REval

Replication package for ICSE 2025 paper "Reasoning Runtime Behavior of a Program with LLM: How Far Are We?".

Prerequisite

Hardware Requirements

For evaluating GPT-3.5/4, there are no special hardware requirements.
For evaluating open-source LLMs, the vLLM library requires capable NVIDIA/AMD GPUs. We use NVIDIA A800 GPU in our experiments. Detailed hardware requirements can be found at vLLM's official website: https://docs.vllm.ai/en/latest/getting_started/installation.html

Software Requirements

OS: Linux
Python: 3.8 - 3.11 (Conda environment is optional but recommended)
Install the JSON processor jq.
Install necessary Python libraries by running pip install -r requirements.txt.
Users without capable GPUs should run pip install -r requirements-nogpu.txt. In this case, they can only evaluate GPT models via OpenAI API.

Note:

We do not distribute open-source LLMs in this replication package due to license and file size issues. Users can download the model checkpoints on HuggingFace.
The vLLM library supports a variety of model architectures, but the latest models may not be supported. For details, please refer to https://docs.vllm.ai/en/latest/models/supported_models.html.

Usage

Important files:

The data folder contains our benchmark data: the adapted dataset (from HumanEval and ClassEval); the task inputs, generated by taskgen.py
The prompts folder contains the prompt templates for four tasks and two prompt types. prompt.py reads the templates and then constructs concrete prompts during evaluation.
The model_list.txt file lists the IDs and URLs of the open-source LLMs we use in the experiments.
The entry of our framework is evaluation.py. Users can run with the command python evaluation.py <args>, and the usage of this command is as follows:

usage: evaluation.py [-h] [-i INPUT] [-o OUTPUT] [{config,run}]

Run evaluation for REval tasks

positional arguments:
  {config,run}          Command to run

options:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        specify the configuration file to load
  -o OUTPUT, --output OUTPUT
                        specify the configuration file to save

Basically, the users can create a config file with python evaluation.py config and run the evaluation using the config file with python evaluation.py run.

The start_server.sh is a Bash script that reads an aforementioned config file and starts a local OpenAI-compatible API server (powered by the vLLM library).
Check out batch_run.py if one wants to run the tasks concurrently (and repeat multiple times).

pan2013e/DREval

REval

Prerequisite

Hardware Requirements

Software Requirements

Usage