Context Rot: How Increasing Input Tokens Impacts LLM Performance

This repository contains the toolkit for replicating results from our technical report.

Motivation

Large Language Models (LLMs) are typically presumed to process context uniformly—that is, the model should handle the 10,000th token just as reliably as the 100th. However, in practice, this assumption does not hold. We observe that model performance varies significantly as input length changes, even on simple tasks.

Latest Models on Repeated Words Task

Experiments

Our experiments are organized under the experiments/ folder:

1. NIAH Extension (`experiments/niah_extension/`)

Extension of Needle in a Haystack to examine the effects of needles with semantic, rather than direct lexical matches, as well as the effects of introducing variations to the haystack content.

2. LongMemEval (`experiments/longmemeval/`)

LongMemEval task.

3. Repeated Words (`experiments/repeated_words/`)

Tests model performance on replicating a sequence of repeated words.

Each experiment contains detailed instructions in their respective README.md files.

Data

Datasets can be downloaded here.

Quick Start

Clone the repository

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies: pip install -r requirements.txt
Set up environment variables:
- OpenAI: OPENAI_API_KEY
- Anthropic: ANTHROPIC_API_KEY
- Google: GOOGLE_APPLICATION_CREDENTIALS and GOOGLE_MODEL_PATH
Navigate to specific experiment folder and follow README instructions

Citation

If you find this work useful, please cite our technical report:

@techreport{hong2025context,
  title = {Context Rot: How Increasing Input Tokens Impacts LLM Performance},
  author = {Hong, Kelly and Troynikov, Anton and Huber, Jeff},
  year = {2025},
  month = {July},
  institution = {Chroma},
  url = {https://research.trychroma.com/context-rot},
}

chroma-core/context-rot