/lm-datasets

A collection of datasets for language model pretraining including scripts for downloading, preprocesssing, and sampling.

Primary LanguagePythonApache License 2.0Apache-2.0

lm-datasets

PRs Welcome

lm-datasets is a collection of datasets for language model training including scripts for downloading, preprocesssing, and sampling.

The documentation is available here.

Quick start

Installation

Install the lm-datasets package with pip:

pip install lm-datasets

In order to keep the package minimal by default, lm-datasets comes with optional dependencies useful for some use cases. For example, if you want to have the text extraction for all available datasets, run:

pip install lm-datasets[datasets]

Download and text extraction

To download and extract the plain-text of one or more datasets, run the following command:

lm_datasets extract_text $DATASET_ID $OUTPUT_DIR

By default, output is saved as JSONL files. To change the output format, you can use the --output_format argument as below:

lm_datasets extract_text $DATASET_ID $OUTPUT_DIR --output_format parquet  --output_compression zstd

Available datasets

A list or table with all available datasets can be print with the follow command:

lm_datasets print_stats --print_output md

Token count by language

Language Tokens
bg 53 B
ca 5 B
code 250 B
cs 128 B
da 34 B
de 795 B
el 108 B
en 6 T
es 674 B
et 15 B
eu 696 M
fi 55 B
fr 655 B
ga 767 M
gl 70 M
hr 8 B
hu 179 B
it 386 B
lt 24 B
lv 14 B
mt 4 B
nl 238 B
nn 307 M
no 9 B
pl 223 B
pt 187 B
ro 77 B
sh 2 M
sk 47 B
sl 11 B
sr 10 B
sv 89 B
uk 47 B

Token count by source

Source Tokens
academic_slovene_kas 1 B
bgnc_admin_eur 79 M
bgnc_news_corpus 18 M
brwac 3 B
bulgarian_news 283 M
bulnc 567 M
cabernet 712 M
cc_gigafida 127 M
colossal_oscar 208 B
croatian_news_engri 695 M
curlicat 410 M
danewsroom 472 M
danish_gigaword 1 B
dewac 2 B
dialogstudio 0
dk_clarin 441 M
enc2021 0
estonian_reference_corpus 175 M
eurlex 121 B
euscrawl 423 M
ga_bilingual_legistation 4 M
ga_universal_dependencies 3 M
greek_legal_code 45 M
greek_web_corpus 3 B
hrwac 1 B
itwac 2 B
korpus_malti 366 M
legal_mc4 29 B
macocu 23 B
marcell_legislative_subcorpus_v2 31 M
norwegian_cc 5 B
openlegaldata 10 B
oscar 9 T
oscar_opengptx 245 B
parlamento_pt 819 M
pes2o 42 B
pl_nkjp 1 M
pl_parliamentary_corpus 671 M
proof_pile 8 B
redpajama 46 B
seimas_lt_en 48 k
sk_court_decisions 11 B
sk_laws 45 M
slwac_web 1 B
sonar 500 M
sonar_new_media 36 M
spanish_legal 3 B
srpkor 0
starcoder 250 B
state_related_latvian_web 1 M
styria_news 409 M
sv_gigaword 1 B
syn_v9 5 B
uk_laws 579 M
wiki 12 B
wikibooks 353 M
wikihow 2 M
wikinews 79 M
wikiquote 268 M
wikisource 2 B
wikivoyage 132 M
ylenews 0

Dataset viewer

We provide a Web-based application through streamlit to browse all datasets and their contained text content. To start the app, first clone this repository, install dependencies, and run the following command:

# clone is needed since streamlit does not support apps from modules yet
git clone https://github.com/malteos/lm-datasets.git

streamlit run src/lm_datasets/viewer/app.py -- \
    --raw_datasets_dir=$RAW_DATASETS_DIR \
    --output_dir=$PROCESSED_DATASET_DIR

Development & Contributions

Setup environment

To setup, your local development environment we recommend conda and cloning the repository. The repository also includes settings and launch scripts for VSCode.

git clone git@github.com:malteos/lm-datasets.git
cd lm-datasets

conda create -n lm-datasets python=3.10
conda activate lm-datasets

pip install -r requirements.txt

Alternatively, you can install the Python package directly from the dev branch:

pip install git+https://github.com/malteos/lm-datasets.git@dev

Install the pre-commit hooks

This repository uses git hooks to validate code quality and formatting.

pre-commit install
git config --bool flake8.strict true  # Makes the commit fail if flake8 reports an error

To run the hooks:

pre-commit run --all-files

Testing

The tests can be executed with:

pytest --doctest-modules --cov-report term --cov=lm_datasets

Acknowledgements

The work on the lm-datasets software is partially funded by the German Federal Ministry for Economic Affairs and Climate Action (BMWK) through the project OpenGPT-X (project no. 68GX21007D).

License

Apache 2.0

(Please note that the actual datasets are released with different licenses)