HELMET: How to Evaluate Long-context Language Models Effectively and Thoroughly

[Paper]

HELMET is a comprehensive benchmark for long-context language models covering seven diverse categories of tasks. The datasets are application-centric and are designed to evaluate models at different lengths and levels of complexity. Please check out the paper for more details, and this repo will detail how to run the evaluation.

Release Progress

Setup

Please install the necessary packages with

pip install -r requirements.txt

Additionally, if you wish to use the API models, you will need to install the package corresponding to the API you wish to use

pip install openai # OpenAI API
pip install anthropic # Anthropic API
pip install google-generativeai # Google GenerativeAI API
pip install together # Together API

You should also set the environmental variables accordingly so the API calls can be made correctly. To see the variable that you should set up, check out model_utils.py and the corresponding class (e.g., GeminiModel).

Data

Data will be uploaded soon :) In the meantime, please contact me to get access

Running evaluation

To run the evaluation, simply use one of the config files in the configs directory, you may also overwrite any arguments in the config file or add new arguments simply through command line:

python eval.py --config configs/cite.yaml --model_name_or_path {local model path or huggingface model name} --output_dir {output directory}

This will output the results file under the output directory.

The full results from our evaluation is here.

Model-based evaluation

To run the model-based evaluation for LongQA and Summarization, please make sure that you have set the environmental variables for OpenAI so you can make calls to GPT-4o, then you can run:

python scripts/eval_gpt4_longqa.py
python scripts/eval_gpt4_summ.py

# Alternatively, if you want to shard the process
bash scripts/eval_gpt4_longqa.sh
bash scripts/eval_gpt4_summ.sh

To specify which model/paths you want to run model-based evaluation for, check out the python scripts and modify the model_to_check field. You may also use Claude, Gemini, or other models for model-based evaluation by modifying the class but we have tested for gpt-4o-2024-05-13.

Adding new tasks

To add a new task/dataset, you just need to modify the data.py file:

Create a function that specifies how to load the data:

Specify the string templates for the task through user_template, system_template, and prompt_template (which is usually just the concatenation of the two)
Process each sample to fit the specified templates (the tokenization code will call user_template.format(**test_sample) and same for system_template). Importantly, each sample should have a context field, which will be truncated automatically if the input is too long (e.g., for QA, this is the retrieved passages; for NarrativeQA, this is the book/script). You should use the question and answer field to make evaluation/printing easier.
Optionally, add a post_process function to process the model output (e.g., for MS MARCO, we use a ranking parse function; for RULER, we calculate the recall). There is also a default_post_process function that parses and calculate simple metrics like EM and F1 that you may use. This function should take in the model output and the test sample and return a tuple of (metrics, changed_output), the metrics (e.g., EM, ROUGE) are aggregated across all samples, and the changed_output are added to the test_sample and saved to the output file.
The function should return {'data': [list of data samples], 'prompt_template': prompt_template, 'user_template': user_template, 'system_template': system_template, 'post_process': [optional custom function]}.

Finally, simply add a new case to the load_data function that calls the function that you just wrote to load your data. You can refer to the existing tasks for examples (e.g., load_json_kv, load_narrativeqa, and load_msmarco_rerank).

Adding new models

The existing code supports using HuggingFace-supported models and API models (OpenAI, Anthropic, Google, and Together). To add a new model or use a different framework (e.g., VLLM), you can modify the model_utils.py file. Specifically, you need to create a new class that implements prepare_inputs (how the inputs are processed) and generate functions. Then, you can add a new case to load_LLM. Please refer to the existing classes for examples.

Dataset correlation analysis

We also analyze the correlation between performance on different datasets. The code will be released soon.

Contacts

If you have any questions, please email me at hyen@cs.princeton.edu. If you encounter any problems, you can also open an issue here. Please try to specify the problem with details so we can help you better and quicker!

Citation

If you find our work useful, please cite us:

@misc{yen2024helmetevaluatelongcontextlanguage,
      title={HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly}, 
      author={Howard Yen and Tianyu Gao and Minmin Hou and Ke Ding and Daniel Fleischer and Peter Izasak and Moshe Wasserblat and Danqi Chen},
      year={2024},
      eprint={2410.02694},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.02694}, 
}

epinnock/HELMET