VALTEST: Automated Validation of Language Model Generated Test Cases

This repository contains the implementation of VALTEST, a framework for automatically validating test cases generated by Large Language Models (LLMs). The goal of VALTEST is to improve the reliability of LLM-generated test cases by leveraging token probabilities to predict the validity of test cases. The framework evaluates the validity, mutation score, and coverage metrics for test cases generated from three popular datasets—HumanEval, MBPP, and LeetCode—across three LLMs: GPT-4o, GPT-3.5-turbo, and LLama3.

Project Structure

  • main_train.py: The primary script for training machine learning models on extracted token probability features to predict test case validity. It handles feature extraction, model training, evaluation, and selection of valid test cases using various machine learning algorithms.

  • generate_testcases.py: A script to generate test cases using different LLMs. This script interacts with APIs such as OpenAI and Huggingface to generate test cases and captures token probabilities.

  • curate_testcases.py: This script refines and validates the generated test cases using a chain-of-thought approach. It also validates assertions within test cases by interacting with LLMs and runs final evaluations on curated test cases.

  • requirements.txt: Contains all dependencies required to run the project, including libraries for machine learning, deep learning, token probability extraction, and mutation testing.

Key Components

Feature Extraction

The framework extracts statistical features from the token probabilities for both function inputs and expected outputs. These features include mean, max, min, sum, variance, and total token counts for both the top-predicted token and the second-predicted token.

Model Training and Evaluation

The machine learning models (e.g., logistic regression, random forest, ensemble models) are trained using the extracted features. The models predict the validity of the test cases by evaluating their probability distributions. A K-fold cross-validation is used for training and testing across different LLM-generated test suites.

Test Case Generation

Test cases are generated by prompting LLMs using function signatures and task descriptions from datasets such as HumanEval, MBPP, and LeetCode. Token probabilities are captured during this process to extract features for model training.

Test Case Curation

Invalid test cases are identified and corrected through a chain-of-thought reasoning process. This step ensures that test cases reflect the expected behavior of the functions, even if the LLM initially generated invalid assertions.

Running the Project

Dependencies

Ensure that all required libraries are installed by running:

pip install -r requirements.txt

Generating Test Cases

To generate test cases from an LLM, use the generate_testcases.py script:

python generate_testcases.py --dataset HumanEval --llm gpt-4o

Supported datasets: MBPP, HumanEval, LeetCode

Supported LLMs: gpt-4o, gpt-3.5-turbo, llama3

Training the Model

To train a model and predict the validity of test cases, use the main_train.py script:

python main_train.py --dataset HumanEval --llm gpt-4o --mutation 0 --threshold 0.8 --topN 5 --features all

Parameters for main_train.py

The main_train.py script accepts several parameters to customize the execution of the test case validation process. Below is a description of each parameter and its use:

--dataset

  • Description: Specifies the dataset to use for generating and evaluating test cases.
  • Choices: MBPP, HumanEval, LeetCode
  • Required: Yes
  • Example:
    --dataset HumanEval

--llm

  • Description: Specifies the Large Language Model (LLM) to use for generating test cases.
  • Choices: gpt-4o, gpt-3.5-turbo, llama3
  • Required: Yes
  • Example:
    --llm gpt-4o

--mutation

  • Description: Enables mutation testing for the selected dataset and LLM. Mutation testing measures how well the test cases detect faults in the code.
  • Choices: 0 (disable), 1 (enable)
  • Default: 0 (disabled)
  • Example:
    --mutation 1

--threshold

  • Description: Specifies the threshold for selecting valid test cases. The threshold defines the minimum probability score for considering a test case as valid.
  • Choices: 0.5, 0.65, 0.7, 0.8, 0.85, 0.9
  • Default: 0.8
  • Example:
    --threshold 0.8

--topN

  • Description: Specifies the number of top test cases to select per function. If fewer than N test cases meet the threshold, the top N cases are selected based on their probability scores.
  • Choices: 1, 3, 5, 7
  • Default: 5
  • Example:
    --topN 5

--features

  • Description: Specifies which feature sets to use for training the model. The feature sets can focus on the function input, the expected output, or both.
  • Choices: all, input, output
  • Default: all
  • Example:
    --features all

Curating Test Cases

To validate and curate the generated test cases:

python curate_testcases.py --dataset HumanEval --llm gpt-4o
python main_train.py --dataset HumanEval --llm gpt-4o --mutation 1

Results and Evaluation

  • Validity Rate (VR): The proportion of test cases that are valid after running on the source code.
  • Mutation Score (MS): The percentage of killed mutants during mutation testing, reflecting the fault-detection capability of the test cases.
  • Line Coverage (LC): Measures how much of the source code is executed by the test cases.