This repository contains code for our paper RULER: What’s the Real Context Size of Your Long-Context Language Models. RULER generates synthetic examples to evaluate long-context language models with configurable sequence length and task complexity. We benchmark 17 open-source models across 4 task categories (in total 13 tasks) in RULER, evaluating long-context capabilities beyond simple in-context recall. Here are our main results.
Models | Claimed Length | Effective Length | 4K | 8K | 16K | 32K | 64K | 128K | Avg. | wAvg. (inc) | wAvg. (dec) |
---|---|---|---|---|---|---|---|---|---|---|---|
Llama2 (7B) | 4K | 85.6 | |||||||||
Gemini-1.5-pro | 1M | >128K | 96.7 | 95.8 | 96.0 | 95.9 | 95.9 | 94.4 | 95.8 | 95.5 (1st) | 96.1 (1st) |
GPT-4-1106-preview | 128K | 64K | 96.6 | 96.3 | 95.2 | 93.2 | 87.0 | 81.2 | 91.6 | 89.0 (2nd) | 94.1 (2nd) |
Command-R-plus (104B) | 128K | 32K | 95.6 | 95.2 | 94.2 | 92.0 | 84.3 | 63.1 | 87.4 | 82.7 (7th) | 92.1 (3rd) |
GLM-4-chat (9B) | 1M | 64K | 94.7 | 92.8 | 92.1 | 89.9 | 86.7 | 83.1 | 89.9 | 88.0 (3rd) | 91.7 (4th) |
GradientAI/Llama3*(70B) | 1M | 32K | 95.2 | 93.4 | 93.4 | 89.4 | 82.6 | 72.0 | 87.7 | 84.0 (6th) | 91.3 (5th) |
Command-R (35B) | 128K | 32K | 93.8 | 93.3 | 92.4 | 89.5 | 84.9 | 76.0 | 88.3 | 85.5 (4th) | 91.1 (6th) |
Mixtral-8x22B (39B/141B) | 64K | 32K | 95.6 | 94.9 | 93.4 | 90.9 | 84.7 | 31.7 | 81.9 | 73.5 (8th) | 90.3 (7th) |
Yi (34B) | 200K | 32K | 93.3 | 92.2 | 91.3 | 87.5 | 83.2 | 77.3 | 87.5 | 84.8 (5th) | 90.1 (8th) |
Mixtral-8x7B (12.9B/46.7B) | 32K | 32K | 94.9 | 92.1 | 92.5 | 85.9 | 72.4 | 44.5 | 80.4 | 72.8 (9th) | 87.9 (9th) |
FILM-7B* (7B) | 32K | 32K | 92.8 | 88.2 | 88.1 | 86.9 | 70.1 | 27.1 | 75.5 | 66.4 (11th) | 84.7 (10th) |
Meta/Llama3* (RoPE |
8K | >8K | 95.4 | 94.7 | 93.2 | 85.9 | 22.5 | 0.0 | 65.3 | 48.6 (14th) | 82.0 (11th) |
Mistral (7B) | 32K | 16K | 93.6 | 91.2 | 87.2 | 75.4 | 49.0 | 13.8 | 68.4 | 55.6 (15th) | 81.2 (12th) |
ChatGLM (6B) | 128K | 4K | 87.8 | 83.4 | 78.6 | 69.9 | 56.0 | 42.0 | 69.6 | 62.0 (13th) | 77.2 (13th) |
LWM (7B) | 1M | <4K | 82.3 | 78.4 | 73.7 | 69.1 | 68.1 | 65.0 | 72.8 | 69.9 (10th) | 75.7 (14th) |
Phi3 (3.8B) | 128K | 4K | 86.7 | 78.1 | 75.6 | 70.3 | 58.9 | 43.3 | 68.8 | 62.2 (12th) | 75.5 (15th) |
DBRX (36B/132B) | 32K | 8K | 95.1 | 93.8 | 83.6 | 63.1 | 2.4 | 0.0 | 56.3 | 38.0 (16th) | 74.7 (16th) |
Qwen (72B) | 32K | 8K | 94.9 | 93.8 | 78.0 | 67.8 | 0.0 | 0.0 | 55.7 | 37.5 (17th) | 74.0 (17th) |
Together (7B) | 32K | 4K | 88.2 | 81.1 | 69.4 | 63.0 | 0.0 | 0.0 | 50.3 | 33.8 (18th) | 66.7 (18th) |
LongChat (7B) | 32K | <4K | 84.7 | 79.9 | 70.8 | 59.3 | 0.0 | 0.0 | 49.1 | 33.1 (19th) | 65.2 (19th) |
LongAlpaca (13B) | 32K | <4K | 60.6 | 57.0 | 56.6 | 43.6 | 0.0 | 0.0 | 36.3 | 24.7 (20th) | 47.9 (20th) |
- Despite achieving nearly perfect performance on the vanilla needle-in-a-haystack (NIAH) test, all models (except for Gemini-1.5-pro) exhibit large degradation on tasks in RULER as sequence length increases.
- While all models claim context size of 32k tokens or greater (except for Llama3), only half of them can effectively handle sequence length of 32K by exceeding a qualitative threshold, Llama2-7b performance at 4K (85.6%). The performance exceeding the threshold is underlined.
- Almost all models fall below the threshold before reaching the claimed context lengths.
- Notes (Meta/Llama3)
- The results are evaluated by changing
rope_theta
to 16M in here.
- The results are evaluated by changing
- Notes (FILM-7B)
- Notes (GradientAI/Llama3)
- The results are submitted by authors.
- Docker container:
docker pull cphsieh/ruler:0.1.0
- The requirements are listed in
docker/Dockerfile
anddocker/requirements.txt
. Use the following command to build the container based on NVIDIA's PyTorch containernvcr.io/nvidia/pytorch:23.08-py3
.
cd docker/
DOCKER_BUILDKIT=1 docker build -f Dockerfile -t cphsieh/ruler:0.1.0 .
- Paul Graham Essays for NIAH are downloaded from NIAH Github and Paul Graham Blog.
- QA datasets are downloaded from SQuAD and HotpotQA.
cd scripts/data/synthetic/json/
python download_paulgraham_essay.py
bash download_qa_dataset.sh
- We download the models from Huggingface.
- The input template of each model is stored in
scripts/data/template.py
. Please add new model template if your new model uses a different chat template. - (Optional) If you are using TensorRT-LLM, please build your model engine based on their example scripts (e.g., Llama) with their Docker container.
- Setup
run.sh
GPUS="" # number of GPUs
ROOT_DIR="" # the path that stores generated task samples and model predictions.
MODEL_DIR="" # the path that contains individual model folders from Huggingface.
ENGINE_DIR="" # the path that contains individual engine folders from TensorRT-LLM.
- Setup
config_models.sh
case $MODEL_NAME in
YOUR_HF_MODEL_NAME)
MODEL_PATH=${MODEL_DIR}/YOUR_MODEL_FOLDER
MODEL_TEMPLATE_TYPE="" # base, meta-chat, etc. defined in `scripts/data/template.py`
MODEL_FRAMEWORK="" # hf or vllm
;;
YOUR_TRTLLM_ENGINE_NAME)
MODEL_PATH=${ENGINE_DIR}/YOUR_ENGINE_FOLDER
MODEL_TEMPLATE_TYPE="" # base, meta-chat, etc. defined in `scripts/data/template.py`
MODEL_FRAMEWORK="trtllm"
;;
YOUR_OPENAI_MODEL_NAME)
MODEL_PATH="" # OpenAI model name listed in https://platform.openai.com/docs/models/
MODEL_TEMPLATE_TYPE="base"
MODEL_FRAMEWORK="openai"
TOKENIZER_PATH="cl100k_base"
TOKENIZER_TYPE="openai"
OPENAI_API_KEY="" # your OpenAI API key
;;
YOUR_GEMINI_MODEL_NAME)
MODEL_PATH="" # Gemini model name listed in https://ai.google.dev/gemini-api/docs/models/gemini
MODEL_TEMPLATE_TYPE="base"
MODEL_FRAMEWORK="gemini"
TOKENIZER_PATH=$MODEL_PATH
TOKENIZER_TYPE="gemini"
GEMINI_API_KEY="" # your Gemini API key
;;
- Start evaluation based on our default
synthetic
benchmark
bash run.sh YOUR_MODEL_NAME synthetic
The tasks to be evaluated on are stored in scripts/config_tasks.sh
. Configuration of each task is defined in scripts/synthetic.yaml
. The complexity of each task can be configured by changing the arguments which we describe in detail below.
Category | Task name | Configurations |
---|---|---|
Retrieval | niah | type_haystack: repeat/essay/needle # repeat: repeated noise sentences # essay: Paul Graham Essays # needle: distracted needles type_needle_k: words/numbers/uuids type_needle_v: words/numbers/uuids # words: adjective-noun # numbers: 7 digits # uuids: 32 digits num_needle_k: int >= 1 # add multiple needles in haystack num_needle_v: int >= 1 # retrieve multiple values from a single key num_needle_q: int >= 1 # retrieve multiple values from multiple keys |
Multi-hop Tracing |
variable_tracking | num_chains: int >= 1 # number of variable name-binding chains num_hops: int >= 1 # number of times binding variable names in each chain |
Aggregation | common_words_extraction | freq_cw: int >= 1 # frequency of common words freq_ucw: int >= 1 # frequency of uncommon words num_cw: int >= 1 # number of common words |
Aggregation | freq_words_extraction | alpha: float > 1.0 # parameter of the distributation to draw synthetic words. Reducing alpha to increase the difficulty of this task. Note that increasing the number of words to return also increases the difficulty of this task, we use 3 in our evaluations as models show worse performance at short context size when more words need to be returned. |
Question Answering |
qa | dataset: squad or hotpotqa # the short-context qa dataset we use |
- Add basic arguments (required) and complexity configurations in the python script.
- Verify the script is reproducible given a tokenizer, a sequence length, and a random seed.
- Save the script under the folder
scripts/data/synthetic
.
- Add
template
andtokens_to_generate
inscripts/data/synthetic/constants.py
. - Add
answer_predfix
to prevent model from refusing to answer.
- Add the automatic metric to evaluate your task in
scripts/eval/synthetic/constants.py
- Define your task name and complexity configurations in
scripts/synthetic.yaml
. - Add your task name in
scripts/config_tasks.sh
While tasks in RULER are designed to be configurable, we only evaluate the above models with 13 task configurations. These tasks were selected because most models can achieve good (some almost perfect) performance at short context size (<= 4K), which leaves ample room to observe degradation as we extend the input length. We did not include more complexed tasks in RULER that models show worse performance at short context size. We also did not stress test every model with more difficult task configurations. Although RULER covers four task categories extending previous evaluation protocol and provides a clean test bed for sanity-checking LMs with known upper bound performance, it is by no means comprehensive enough and it cannot replace the more preferred realistic tasks. We welcome people to contribute new tasks and/or new task categories to help evaluate long-context capabilities.
@article{hsieh2024ruler,
title={RULER: What's the Real Context Size of Your Long-Context Language Models?},
author={Cheng-Ping Hsieh and Simeng Sun and Samuel Kriman and Shantanu Acharya and Dima Rekesh and Fei Jia and Yang Zhang and Boris Ginsburg},
year={2024}
journal={arXiv preprint arXiv:2404.06654},
}
Disclaimer: This project is strictly for research purposes, and not an official product from NVIDIA.