Analysing The Impact of Sequence Composition on Language Model Pre-Training


This repository hosts the data and code of the paper: Analysing The Impact of Sequence Composition on Language Model Pre-Training

Data Processing

Download SlimPajama

bash ./scripts/download_slimpajama.sh

Split Data Based on Meta-Information

Decompress and split SlimPajama to subsets according to the meta-information of documents

export PYTHONPATH="./"
python ./preprocessing/split_to_subsets.py

Pre-Tokenize and Build Memmap Offset

export PYTHONPATH="./"
python ./preprocessing/create_corpus.py

We split each subset to several files, defined by SUBSET_SPLIT_NUMS in project_config.py. Each file is saved in ./data/SlimPajama-150B/[subset_name]/[[subset_name]_chunk[file_idx]_processed.jsonl].

Save Offline Dataset

MixChunk

python ./save_offline_dataset.py --packing_strategy=mixchunk

The result data is saved in ./data/offline_datasets/mixchunk.

UniChunk

python ./save_offline_dataset.py --packing_strategy=unichunk

BM25Chunk

BM25 retrieval is based on Retriv

Build index:

python build_bm25_index.py

It builds BM25 index for each file independently. Each index is saved in ./data/bm25index/collections/[subset_name]_[file_idx]

Retrieval strategy: retriv_bm25.py and retrieval_packing.py

Construct BM25Chunk in one host:

python ./save_offline_dataset.py --packing_strategy=bm25chunk

Or construct BM25Chunk for each file by running:

python ./save_offline_dataset.py \
  --packing_strategy=bm25chunk \
  --bm25chunk_onefile \
  --subset_name=RedPajamaWikipedia \
  --file_idx=0

This is an example to construct BM25Chunk for one file, and we can distribute these construction tasks to different CPU cores and hosts. subset_name and its total number of split files are defined in project_config.py. After constructing BM25Chunk for all files, combine the data together by running:

python ./save_offline_dataset.py --packing_strategy=bm25chunk --combine_data

Evaluation

Download datasets

python ./scripts/download_eval_data.py

Reading comprehension and retrieval-augmented generation:

cd ./evaluation
bash ./mrc.sh

Knowledge memorisation:

cd ./evaluation
bash ./cbqa.sh

In-context learning:

cd ./evaluation
bash ./icl.sh

Analysis

Burstiness

Calculate the Zipf's coefficient of token frequency: ./analysis/burstiness.py

Distraction Proportion

Visualise the distraction proportion: ./analysis/distraction.py

Citing

@article{zhao2024analysing,
  title={Analysing The Impact of Sequence Composition on Language Model Pre-Training},
  author={Zhao, Yu and Qu, Yuanbin and Staniszewski, Konrad and Tworkowski, Szymon and Liu, Wei and Mi{\l}o{\'s}, Piotr and Wu, Yuxiang and Minervini, Pasquale},
  journal={arXiv preprint arXiv:2402.13991},
  year={2024}
}