GenAI Testing: Strategies for Ensuring Reliable Text-Based Applications

Friday, 01 March 2024

Testing generative AI applications requires creative thinking and is not the same as testing traditional software applications.  GenAI applications are available for text, images, audio and video.  This paper deals with text only; Large Language Models.  LLMs are now freely available to purchase; pre-trained then fine-tuned for their specific application and organisation.  To increase accuracy and avoid the well-known concept of GenAI hallucinations, LLMs are increasingly paired with RAG – Retrieval Augmented Generation, which is a technique to reference known additional data and provide context around prompts before they reach the LLM, to improve the quality of the output, but also to provide relevant data missing from the LLM, e.g. company specific or more up to date information.  RAG allows supporting context to be provided from specific and trusted data sources, so that the output from the LLM is grounded in fact, which helps avoid inconsistent and inaccurate output. 

One of the key principles of software testing, is that testing should be performed on a known version of a piece of software, including known versions of all its components; this is in addition to probably the most well-known (and relevant to GenAI) testing principle – that exhaustive testing is impossible!  Addressing these two elements are the key to unlocking quality testing for GenAI applications. It’s also important to note that testing text-based GenAI is testing a combination of the RAG contextual data, the LLM itself, Prompts, Model Temperature and other settings. 

Knowing the version number and content / corresponding behaviour of an LLM, supplemented by RAG or not, is very difficult.  LLM’s include vast amounts of information to start with, which is supplemented by the addition of RAG data, user feedback, changes to moderation layers, as well as model weighting and temperature settings.  The changing nature of LLM’s, coupled with the difficulties of nailing down a specific and known version of the test object mean that the behaviour of the LLM will change over time, a phenomenon known as drift. Drift can mean that the same questions posed to a model start to change fractionally over time.   These changes can include the introduction of bias and can, if left unchecked, result in models that show significant divergence from what was originally viewed as acceptable. 

These factors mean that as well as testing a model on implementation, that tests should be repeated on a regular basis, to monitor drift and ensure the model remains true to its original specification, promoting fairness and limiting bias.  The need for these regression tests make automation ideal, but there is still an elephant in the room; how do we test a model when the answers can and do vary, even though they may mean the same thing? 

There is another factor to consider; in more traditional software evaluation it helps to understand the context and reason for problems found – aka de-bugging, which helps developers quickly fix problems.  Investigating the source of identified problems in a GenAI application is obviously more difficult, bearing in mind the scale of data involved, as well as other factors including prompts, RAG and model settings. Tracking responses against the source of the information used is still possible, with the help of tools, as we will see later in this article. 


Testing GenAI Applications  

The following types of testing are relevant when implementing GenAI applications. 


Data Quality Testing 

To evaluate data quality with RAG enabled LLM’s, testing should be split into two parts.  Tests should be designed to test the database that is fulfilling the RAG functions first, to identify and remove any errors before moving on to testing the eventual output of the LLM, so that issues can be identified.  Testing the RAG database will be easier as it should contain application or organisation relevant, specific information which tests can be designed for.  Tests can also be repeated for different RAG databases to be compared, to evaluate those that perform the best, against the business uses of the app under test.  An example of an organisational implementation of this technology could be the implementation of a chatbot to provide answers on company procedures, HR policies, handbook etc. 

Once testing has established confidence that RAG is working as it should and is returning the correct information, testing can move on to including the LLM, from a known foundation of verified data.  Again, this can be carried out using different LLMs to answer the same questions if needed.   

Data quality analysis is a vital step that should include tests to verify key use cases of the application in its implementation – ideally answers that can be readily checked.  This is worth carrying out manually initially, but the questions and answers identified in this phase can be useful when designing automated regression tests. 


Functional / Adversarial and Bias Testing 

Functional Testing of the model and its prompts is important, also to test edge cases and boundary values, ensuring consistency and that small changes correctly change the output. This is an opportunity to test guardrails and boundaries, to see whether the model can be manipulated into outputting information that it should not, by design. 

As a part of testing the boundaries and measuring the response of the LLM to different inputs, bias and fairness testing should also be considered during this test phase.  This includes gender and racial stereotypes, tweaking prompts to amplify or reveal possible bias in these areas. 

Model temperature is an important setting, which controls the balance between random and creative output on the one hand, to more deterministic output on the other.  This should be factored into the testing of the inputs, to monitor the results.  The need to repeat these tests does increase the likelihood that test automation would be of benefit. 


Testing for Accuracy and Completeness – Regression Testing 

Regression testing is likely the primary way by which LLM’s need to be tested.  Compiling regression packs utilising test automation involves a significant investment, which should be re-couped if the pack is flexible and allows changes to be easily made as models evolve. 

Working out an optimal strategy for testing a LLM model, we can use the RAG principle to our advantage, as a definition of truth to drive out the right answers.  Our approach is to identify a series of questions that are based on the expected use of the system; the contents of the RAG and the LLM itself.  These will be questions that have factual answers, grounded in a benchmark truth that can be relatively easily assessed.  The steps involved in the test process are as follows, at a high level. 

 

  • Analysis: carry out a detailed analysis of the LLM, the RAG and the intended uses of the GenAI application or chatbot.  If considering a chatbot already in use, logs are a useful place to start to identify frequently asked question, but using RAG, chunks of information that represent key facts, or ‘benchmark truths’ can be identified and GenAI can be used to quickly generate key questions on these chunks.

  • Test Design: Establish a series of questions based on a benchmark truth established from chunks of data. GenAI itself can be used relatively easily to document a series of questions and model answers from a chunk of text on a particular subject.  Ideally a range of questions and answers should be compiled to represent key information, the more the better.  Though these are developed by GenAI it is important that these are carefully checked manually before their use.  So that the questions and answers can be used by an automation tool or test harness, they should be saved in a csv file or database and attached a date and a version number. 

  • Set Measurement Criteria: A series of scoring levels should then be defined for two variables – accuracy and completeness.  These should be defined in a range of 1-5 typically, with each level clearly defined.  These questions and answers will form a benchmark truth of the system and can be used in regression testing to monitor drift and to ensure the bias and fairness attributes present in the application initially remain in place. 

  • Implement Logging: using augmentation of the model, logging can be implemented within the LLM, to identify the chain of thought and the context on which the answer to the submitted questions has been identified.  This will help in de-bugging. 

  • Design Test Automation Harness: an automation tool will be used to drive the GenAI app or chatbot.  The tool could be as simple as a python script, or as complex as an enterprise automation tool.  The automation assets can then be added into a CI/CD tool so they can be run automatically.  The automation harness will use the LLM to compare the output of each answer to the benchmark truth.  The answers provided in the initial rounds should also be independently scored by a member of the team and then cross referenced against the scores allocated by the LLM.  Assuming there are no huge variances, the tests and scoring can be assumed as valid. 

  • Debug and Pre-Test: tests should be run against all questions repetitively, at least 20-30 times, so that it can be checked that the automation harness is working consistently. 

  • Regular Execution:  the tests can then form a part of a regression pack and be run on a regular basis to ensure that the model remains consistent over time, against important key scenarios. 

  • Maintenance: as new areas of frequent use or concern come to light, these can be added to the test pack.  Where there are tests that are no longer relevant, these can be retired. 

 

Security Testing 

Security testing does need to be considered, to avoid ‘prompt injection’ which seeks to manipulate prompts maliciously.  Testing is needed to ensure that prompts of this nature are unable to affect the application under test and that additional prompts are isolated accordingly. 


Performance Testing 

Where applications are planned to be used by large numbers of concurrent users and response times are important, performance tests should be carried out, to ensure applications can support the anticipated load – just as with traditional software applications. 


Crowd Testing 

Due to the nature of GenAI applications, testing by many users is likely to be an important part of a pyramid of tests needed to be confident in these apps, especially when considering exploratory testing and the ability to gain perspective of different social or ethnic groups. 


Assessing Context / Tools
 

When running tests and analysing results that do not meet expectations for the delivery of accurate, complete and non-biased responses, it is useful to be able to see where the LLM is getting its answers from.  This can be facilitated using a tool to provide insight into chain of thought and the references in the LLM used by the generative application to make its decisions.  Understanding this context is very useful and can be carried out using tools including: 

  • Watsonx.governance by IBM 
  • Langsmith by Langchain: create traces, version and comment on prompts, add labels and feedback on traces and construct datasets 
  • Giskard – ML Testing Library, AI Quality Hub, LLM Monitoring – open source and enterprise 

 

Other tools are coming onto the market all the time, Prolifics continue to review the market and select best of breed tooling to ensure maximum benefit for our customers.


Contact Us for a no obligation chat about how to tackle testing on your GenAI applications.

Jonathan Binks - Head of Delivery
Prolifics Testing UK

Scroll to top