malaysia-ai/dedup-text-dataset

Dedup and postprocessing for text dataset.

Jupyter Notebook

dedup-text-dataset

Dedup and postprocessing for text dataset gathered from https://github.com/users/huseinzol05/projects/1

All dedup and postprocessed dataset uploaded at https://huggingface.co/datasets/malaysia-ai/dedup-text-dataset

Server spec

24 cores.
220 GB RAM.

Deduping can explode the memory, easily eat up to 30 GB if the dataset is > 10GB, so beware.

Download dataset

Most of download files are straight forward,

wget https://huggingface.co/datasets/mesolitica/crawl-amanz-my/resolve/main/parsed.jsonl -O hf-datasets/raw-datasets/amanz.jsonl

But sometime we have to some preprocessing like,

We save raw datasets at hf-datasets/raw-datasets.

Text dedup

Clone remove-duplicate-text-dataset.ipynb to new notebook, eg, remove-duplicate-text-dataset-lowyat.ipynb.

This notebook use text_dedup to do dedup, borrowed from https://github.com/ChenghaoMou/text-dedup

All dedup datasets will save at hf-datasets/dedupe-datasets.

Postprocessing

Run postprocessing.ipynb to start postprocessing,

remove texts that contain HTTP errors.
remove texts less than 3 characters.
replace 6 spaces or more with 6 spaces.
replace 6 dots or more with 6 dots.

Rerun this notebook will not overwrite postprocessed datasets.

Prepare for training session

There is no consideration AI alignment and safety in current dataset, we only apply basic postfilter.

end-to-end processing using Python script

Released as a Python library, https://github.com/malaysia-ai/clean_text_my