/eval_pipeline

An evaluation pipeline for autoregressive language models using direct probability measurement for minimal pairs.

Primary LanguageJupyter NotebookMIT LicenseMIT

Minimal Pairs Eval Pipeline

An evaluation pipeline for autoregressive language models using direct probability measurement for minimal pairs.

This pipeline evaluates language models by reading out the conditional log probabilities for minimal pairs of sentences. In each pair, one sentence is considered correct, while the other contains a minimal violation. The model is expected to assign a lower probability to the incorrect sentence.

By using a sufficient number of test items targeting specific linguistic phenomena, the accuracy of the model’s probability assignments provides an indication of its linguistic capabilities and understanding of these phenomena. Assessing models at different training checkpoints allows for analyzing learning dynamics of selected phenomena.

Overview

Models

AI2-OLMo EleutherAI-Pythia
Huggingface Suite Huggingface Suite
Github Github
Technical Report Technical Report
Website Website

Both models were released in different parameter sizes at different intermediate training checkpoints (revisions). This makes it possible to test for emerging capabilities across parameter scale and training time.

Setup

  • tested on Python 3.12.x, 3.11.x, 3.10.x
  • requires GPU with CUDA >= 12.1 support (smaller models can run on CPU, but not recommended)

venv

uv venv
# macOS / Linux
source .venv/bin/activate
# Windows
.venv\Scripts\activate
uv pip install -r requirements.txt

conda

conda env create -f environment.yml
conda activate pipe

Datasets for evaluation

An example dataset for testing can be found in the data folder. Additional datasets can easily be integrated and tested. Please refer to the corresponding README.md in the folder for more details.

Running experiments

Run the Python script and specify the dataset, model and optionally revision (defaults to main, final checkpoint for all models).

To access different intermediate training checkpoints (revisions), check either Pythia or OLMo suites on Huggingface, select a model's Files and versions and choose a corresponding branch.

# Template
python run_eval.py {dataset} {model} {optional: revision}
  • dtfit

    python run_eval.py dtfit EleutherAI/pythia-14m

ToDo

  • Performance
    • fix batch support
  • Optional
    • add support for commercial APIs as upper bound
    • extract & analyze contextual word embeddings
    • test other open models with checkpoints?
      • togethercomputer/RedPajama-INCITE-7B-Base
      • TinyLlama/TinyLlama-1.1B
      • Zyphra/Zamba-7b
      • Ablation Models?
        • checkpoints available for different common datasets for pretraining

Author

  • Maximilian Krupop

Back to Top