Dedup and postprocessing for text dataset gathered from https://github.com/users/huseinzol05/projects/1
All dedup and postprocessed dataset uploaded at https://huggingface.co/datasets/malaysia-ai/dedup-text-dataset
- 24 cores.
- 220 GB RAM.
Deduping can explode the memory, easily eat up to 30 GB if the dataset is > 10GB, so beware.
- Most of download files are straight forward,
wget https://huggingface.co/datasets/mesolitica/crawl-amanz-my/resolve/main/parsed.jsonl -O hf-datasets/raw-datasets/amanz.jsonl
But sometime we have to some preprocessing like,
We save raw datasets at hf-datasets/raw-datasets.
- Clone remove-duplicate-text-dataset.ipynb to new notebook, eg, remove-duplicate-text-dataset-lowyat.ipynb.
This notebook use text_dedup to do dedup, borrowed from https://github.com/ChenghaoMou/text-dedup
All dedup datasets will save at hf-datasets/dedupe-datasets.
- Run postprocessing.ipynb to start postprocessing,
- remove texts that contain HTTP errors.
- remove texts less than 3 characters.
- replace 6 spaces or more with 6 spaces.
- replace 6 dots or more with 6 dots.
Rerun this notebook will not overwrite postprocessed datasets.
There is no consideration AI alignment and safety in current dataset, we only apply basic postfilter.
Released as a Python library, https://github.com/malaysia-ai/clean_text_my