/REval

The official replication package of ICSE 2025 Paper "Reasoning Runtime Behavior of a Program with LLM: How Far Are We?"

Primary LanguagePython

REval

Replication package for ICSE 2025 paper "Reasoning Runtime Behavior of a Program with LLM: How Far Are We?".

Prerequisite

Hardware Requirements

  • For evaluating GPT-3.5/4, there are no special hardware requirements.
  • For evaluating open-source LLMs, the vLLM library requires capable NVIDIA/AMD GPUs. We use NVIDIA A800 GPU in our experiments. Detailed hardware requirements can be found at vLLM's official website: https://docs.vllm.ai/en/latest/getting_started/installation.html

Software Requirements

  • OS: Linux
  • Python: 3.10 (Conda environment is optional but recommended)
  • Install the JSON processor jq.
  • Install necessary Python libraries by running pip install -r requirements.txt.
  • (Colin) For compatibility, run pip install --upgrade openai.
  • Users without capable GPUs should run pip install -r requirements-nogpu.txt. In this case, they can only evaluate GPT models via OpenAI API.

Note:

  • We do not distribute open-source LLMs in this replication package due to license and file size issues. Users can download the model checkpoints on HuggingFace.
  • The vLLM library supports a variety of model architectures, but the latest models may not be supported. For details, please refer to https://docs.vllm.ai/en/latest/models/supported_models.html.

Usage

Important files:

  • The data folder contains our benchmark data: the adapted dataset (from HumanEval and ClassEval); the task inputs, generated by taskgen.py
  • The prompts folder contains the prompt templates for four tasks and two prompt types. prompt.py reads the templates and then constructs concrete prompts during evaluation.
  • The model_list.txt file lists the IDs and URLs of the open-source LLMs we use in the experiments.
  • The entry of our framework is evaluation.py. Users can run with the command python evaluation.py <args>, and the usage of this command is as follows:
usage: evaluation.py [-h] [-i INPUT] [-o OUTPUT] [{config,run}]

Run evaluation for REval tasks

positional arguments:
  {config,run}          Command to run

options:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        specify the configuration file to load
  -o OUTPUT, --output OUTPUT
                        specify the configuration file to save

Basically, the users can create a config file with python evaluation.py config and run the evaluation using the config file with python evaluation.py run.

  • The start_server.sh is a Bash script that reads an aforementioned config file and starts a local OpenAI-compatible API server (powered by the vLLM library).
  • Check out batch_run.py if one wants to run the tasks concurrently (and repeat multiple times).