[Paper]
HELMET is a comprehensive benchmark for long-context language models covering seven diverse categories of tasks. The datasets are application-centric and are designed to evaluate models at different lengths and levels of complexity. Please check out the paper for more details, and this repo will detail how to run the evaluation.
- HELMET Code
- HELMET data
- Correlation analysis notebook
- Retrieval setup
- VLLM Support
Please install the necessary packages with
pip install -r requirements.txt
Additionally, if you wish to use the API models, you will need to install the package corresponding to the API you wish to use
pip install openai # OpenAI API
pip install anthropic # Anthropic API
pip install google-generativeai # Google GenerativeAI API
pip install together # Together API
You should also set the environmental variables accordingly so the API calls can be made correctly. To see the variable that you should set up, check out model_utils.py
and the corresponding class (e.g., GeminiModel
).
Data will be uploaded soon :) In the meantime, please contact me to get access
To run the evaluation, simply use one of the config files in the configs
directory, you may also overwrite any arguments in the config file or add new arguments simply through command line:
python eval.py --config configs/cite.yaml --model_name_or_path {local model path or huggingface model name} --output_dir {output directory}
This will output the results file under the output directory.
The full results from our evaluation is here.
To run the model-based evaluation for LongQA and Summarization, please make sure that you have set the environmental variables for OpenAI so you can make calls to GPT-4o, then you can run:
python scripts/eval_gpt4_longqa.py
python scripts/eval_gpt4_summ.py
# Alternatively, if you want to shard the process
bash scripts/eval_gpt4_longqa.sh
bash scripts/eval_gpt4_summ.sh
To specify which model/paths you want to run model-based evaluation for, check out the python scripts and modify the model_to_check
field.
You may also use Claude, Gemini, or other models for model-based evaluation by modifying the class but we have tested for gpt-4o-2024-05-13
.
To add a new task/dataset, you just need to modify the data.py
file:
Create a function that specifies how to load the data:
- Specify the string templates for the task through
user_template
,system_template
, andprompt_template
(which is usually just the concatenation of the two) - Process each sample to fit the specified templates (the tokenization code will call
user_template.format(**test_sample)
and same forsystem_template
). Importantly, each sample should have acontext
field, which will be truncated automatically if the input is too long (e.g., for QA, this is the retrieved passages; for NarrativeQA, this is the book/script). You should use thequestion
andanswer
field to make evaluation/printing easier. - Optionally, add a
post_process
function to process the model output (e.g., for MS MARCO, we use a ranking parse function; for RULER, we calculate the recall). There is also adefault_post_process
function that parses and calculate simple metrics like EM and F1 that you may use. This function should take in the model output and the test sample and return a tuple of(metrics, changed_output)
, themetrics
(e.g., EM, ROUGE) are aggregated across all samples, and thechanged_output
are added to the test_sample and saved to the output file. - The function should return
{'data': [list of data samples], 'prompt_template': prompt_template, 'user_template': user_template, 'system_template': system_template, 'post_process': [optional custom function]}
.
Finally, simply add a new case to the load_data
function that calls the function that you just wrote to load your data.
You can refer to the existing tasks for examples (e.g., load_json_kv
, load_narrativeqa
, and load_msmarco_rerank
).
The existing code supports using HuggingFace-supported models and API models (OpenAI, Anthropic, Google, and Together). To add a new model or use a different framework (e.g., VLLM), you can modify the model_utils.py
file.
Specifically, you need to create a new class that implements prepare_inputs
(how the inputs are processed) and generate
functions. Then, you can add a new case to load_LLM
.
Please refer to the existing classes for examples.
We also analyze the correlation between performance on different datasets. The code will be released soon.
If you have any questions, please email me at hyen@cs.princeton.edu
.
If you encounter any problems, you can also open an issue here. Please try to specify the problem with details so we can help you better and quicker!
If you find our work useful, please cite us:
@misc{yen2024helmetevaluatelongcontextlanguage,
title={HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly},
author={Howard Yen and Tianyu Gao and Minmin Hou and Ke Ding and Daniel Fleischer and Peter Izasak and Moshe Wasserblat and Danqi Chen},
year={2024},
eprint={2410.02694},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2410.02694},
}