This repository contains the implementation of VALTEST, a framework for automatically validating test cases generated by Large Language Models (LLMs). The goal of VALTEST is to improve the reliability of LLM-generated test cases by leveraging token probabilities to predict the validity of test cases. The framework evaluates the validity, mutation score, and coverage metrics for test cases generated from three popular datasets—HumanEval, MBPP, and LeetCode—across three LLMs: GPT-4o, GPT-3.5-turbo, and LLama3.
-
main_train.py
: The primary script for training machine learning models on extracted token probability features to predict test case validity. It handles feature extraction, model training, evaluation, and selection of valid test cases using various machine learning algorithms. -
generate_testcases.py
: A script to generate test cases using different LLMs. This script interacts with APIs such as OpenAI and Huggingface to generate test cases and captures token probabilities. -
curate_testcases.py
: This script refines and validates the generated test cases using a chain-of-thought approach. It also validates assertions within test cases by interacting with LLMs and runs final evaluations on curated test cases. -
requirements.txt
: Contains all dependencies required to run the project, including libraries for machine learning, deep learning, token probability extraction, and mutation testing.
The framework extracts statistical features from the token probabilities for both function inputs and expected outputs. These features include mean, max, min, sum, variance, and total token counts for both the top-predicted token and the second-predicted token.
The machine learning models (e.g., logistic regression, random forest, ensemble models) are trained using the extracted features. The models predict the validity of the test cases by evaluating their probability distributions. A K-fold cross-validation is used for training and testing across different LLM-generated test suites.
Test cases are generated by prompting LLMs using function signatures and task descriptions from datasets such as HumanEval, MBPP, and LeetCode. Token probabilities are captured during this process to extract features for model training.
Invalid test cases are identified and corrected through a chain-of-thought reasoning process. This step ensures that test cases reflect the expected behavior of the functions, even if the LLM initially generated invalid assertions.
Ensure that all required libraries are installed by running:
pip install -r requirements.txt
To generate test cases from an LLM, use the generate_testcases.py
script:
python generate_testcases.py --dataset HumanEval --llm gpt-4o
Supported datasets: MBPP
, HumanEval
, LeetCode
Supported LLMs: gpt-4o
, gpt-3.5-turbo
, llama3
To train a model and predict the validity of test cases, use the main_train.py
script:
python main_train.py --dataset HumanEval --llm gpt-4o --mutation 0 --threshold 0.8 --topN 5 --features all
The main_train.py
script accepts several parameters to customize the execution of the test case validation process. Below is a description of each parameter and its use:
- Description: Specifies the dataset to use for generating and evaluating test cases.
- Choices:
MBPP
,HumanEval
,LeetCode
- Required: Yes
- Example:
--dataset HumanEval
- Description: Specifies the Large Language Model (LLM) to use for generating test cases.
- Choices:
gpt-4o
,gpt-3.5-turbo
,llama3
- Required: Yes
- Example:
--llm gpt-4o
- Description: Enables mutation testing for the selected dataset and LLM. Mutation testing measures how well the test cases detect faults in the code.
- Choices:
0
(disable),1
(enable) - Default:
0
(disabled) - Example:
--mutation 1
- Description: Specifies the threshold for selecting valid test cases. The threshold defines the minimum probability score for considering a test case as valid.
- Choices:
0.5
,0.65
,0.7
,0.8
,0.85
,0.9
- Default:
0.8
- Example:
--threshold 0.8
- Description: Specifies the number of top test cases to select per function. If fewer than
N
test cases meet the threshold, the topN
cases are selected based on their probability scores. - Choices:
1
,3
,5
,7
- Default:
5
- Example:
--topN 5
- Description: Specifies which feature sets to use for training the model. The feature sets can focus on the function input, the expected output, or both.
- Choices:
all
,input
,output
- Default:
all
- Example:
--features all
To validate and curate the generated test cases:
python curate_testcases.py --dataset HumanEval --llm gpt-4o
python main_train.py --dataset HumanEval --llm gpt-4o --mutation 1
- Validity Rate (VR): The proportion of test cases that are valid after running on the source code.
- Mutation Score (MS): The percentage of killed mutants during mutation testing, reflecting the fault-detection capability of the test cases.
- Line Coverage (LC): Measures how much of the source code is executed by the test cases.