Replication package for ICSE 2025 paper "Reasoning Runtime Behavior of a Program with LLM: How Far Are We?".
- For evaluating GPT-3.5/4, there are no special hardware requirements.
- For evaluating open-source LLMs, the vLLM library requires capable NVIDIA/AMD GPUs. We use NVIDIA A800 GPU in our experiments. Detailed hardware requirements can be found at vLLM's official website: https://docs.vllm.ai/en/latest/getting_started/installation.html
- OS: Linux
- Python: 3.10 (Conda environment is optional but recommended)
- Install the JSON processor
jq
. - Install necessary Python libraries by running
pip install -r requirements.txt
. - (Colin) For compatibility, run
pip install --upgrade openai
. - Users without capable GPUs should run
pip install -r requirements-nogpu.txt
. In this case, they can only evaluate GPT models via OpenAI API.
Note:
- We do not distribute open-source LLMs in this replication package due to license and file size issues. Users can download the model checkpoints on HuggingFace.
- The vLLM library supports a variety of model architectures, but the latest models may not be supported. For details, please refer to https://docs.vllm.ai/en/latest/models/supported_models.html.
Important files:
- The
data
folder contains our benchmark data: the adapted dataset (from HumanEval and ClassEval); the task inputs, generated bytaskgen.py
- The
prompts
folder contains the prompt templates for four tasks and two prompt types.prompt.py
reads the templates and then constructs concrete prompts during evaluation. - The
model_list.txt
file lists the IDs and URLs of the open-source LLMs we use in the experiments. - The entry of our framework is
evaluation.py
. Users can run with the commandpython evaluation.py <args>
, and the usage of this command is as follows:
usage: evaluation.py [-h] [-i INPUT] [-o OUTPUT] [{config,run}]
Run evaluation for REval tasks
positional arguments:
{config,run} Command to run
options:
-h, --help show this help message and exit
-i INPUT, --input INPUT
specify the configuration file to load
-o OUTPUT, --output OUTPUT
specify the configuration file to save
Basically, the users can create a config file with python evaluation.py config
and run the evaluation using the config file with python evaluation.py run
.
- The
start_server.sh
is a Bash script that reads an aforementioned config file and starts a local OpenAI-compatible API server (powered by the vLLM library). - Check out
batch_run.py
if one wants to run the tasks concurrently (and repeat multiple times).