SailCraft: Data Toolkit for Sailor Language Models

This repository provides a data processing pipeline for large language model training. It consists of four stages: initial data cleaning, near deduplication, exact deduplication, and a second round of data cleaning. The data cleaning part is especially optimized for south-east asian languages (e.g., Thai).

Requirements

Install the packages and download the models for data cleaning. Here we only download the models for English, Chinese, Thai, Vietnamese, Indonesian, Malay, and Lao. You can add more languages by modifying the --used_language_ids parameter. The full language list can be found here.

pip install -r requirements.txt
mkdir lm_resource
python code/data_cleaning/download_sentencepiece_kenlm_models.py --used_language_ids en zh th vi id ms lo --output_dir_path lm_resource

Install Rust for exact deduplication, refer to this guidance for more details.

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
. "$HOME/.cargo/env"

Quickstart

We sample 1,000 lines from the cc100 Indonesian subset for a preliminary analysis.

Execute the script by running:

bash run_example.sh

Upon successful execution, you should observe the following logs indicating the processing stages:

Counting lines in cleaned data output: 987
Counting lines in near deduplication output: 974
Counting lines in exact deduplication output: 963
Counting lines in final output: 949

This output confirms the sequential filtering and deduplication stages of the dataset. The final output can be accessed at data/data_output/final_output/sample/data_clean.jsonl.

Running with Your Own Dataset

To integrate your own dataset into the project, follow these steps:

Prepare Your Dataset: Place your dataset file, named ALIAS.jsonl, in the ./data/data_input/ directory.
Configure Script Variables: Adjust the ALIAS and LANGUAGE variables in the ./run_example.sh script to correspond with your dataset details.

Parameter Settings

Ensure proper configuration of the processes by setting the following parameters:

Data Cleaning: Set the parameters for each filter. Detailed configuration can be found here.
Near Deduplication: Specify the number of permutations to use in MinHash by referring to the example here.
Exact Deduplication: Define the identified substrings of the given length as shown in the example here.

Case Studies

For data cleaning, check the code/data_cleaning/filtering_logs for each filter.
Run code/exact_dedup/scripts/count_topk_occurrences.py to obtain the top-k occurrences.

python code/exact_dedup/scripts/count_topk_occurrences.py \
--data_alias sample \
--split train \
--top_k_number 100 \
--threshold 2 \
--cache_dir cache/exact_dedup_cache

This script displays the top 100 most frequent text spans that occur more than twice in the dataset.

Count	Span
4	'pernah disentuh oleh manusia sebelum mereka (penghuni-penghuni surga yang menjadi suami mereka) dan'
4	'k pernah disentuh oleh manusia sebelum mereka (penghuni-penghuni surga yang menjadi suami mereka) da'
4	'nah disentuh oleh manusia sebelum mereka (penghuni-penghuni surga yang menjadi suami mereka) dan tid'
4	'sentuh oleh manusia sebelum mereka (penghuni-penghuni surga yang menjadi suami mereka) dan tidak pul'
4	'uh oleh manusia sebelum mereka (penghuni-penghuni surga yang menjadi suami mereka) dan tidak pula ol'
4	'ah disentuh oleh manusia sebelum mereka (penghuni-penghuni surga yang menjadi suami mereka) dan tida'
4	'ak pernah disentuh oleh manusia sebelum mereka (penghuni-penghuni surga yang menjadi suami mereka) d'
4	'ernah disentuh oleh manusia sebelum mereka (penghuni-penghuni surga yang menjadi suami mereka) dan t'
3	'manusia sebelum mereka (penghuni-penghuni surga yang menjadi suami mereka) dan tidak pula oleh jin.'
3	'tidak pernah disentuh oleh manusia sebelum mereka (penghuni-penghuni surga yang menjadi suami mereka'
3	'tidak pernah disentuh oleh manusia sebelum mereka (penghuni-penghuni surga yang menjadi suami merek'

Acknowledgment

Thanks to the contributors of the following projects:

Citing this work

If you use this repository or sailor models, please cite

@misc{dou2024sailor,
      title={Sailor: Open Language Models for South-East Asia}, 
      author={Longxu Dou and Qian Liu and Guangtao Zeng and Jia Guo and Jiahui Zhou and Wei Lu and Min Lin},
      year={2024},
      eprint={2404.03608},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contact

If you have any questions, please raise an issue on our GitHub repository or contact doulx@sea.com.

jxqlovejava/sailcraft