How to create training data through pipeline
b3y0nd opened this issue · 4 comments
I want to train the NLLB model, as instructed by the data ReadMe documentation, I have tried the filtering pipeline and got the output of populate_data_conf.py
and compute_length_factors.py
. But I don't know how to run prepare_data pipeline. Especially the three parameters required by prepare_data.py
, such as the yaml file required by the --data-config
parameter, etc. Could you provide an example? Thanks a lot.
In addition, what is the relationship between filtering pipeline and prepare_data pipeline? The latter doesn't seem to use the output of the former.
The compute_length_factors.py
used in the filtering pipeline doesn't seem to be updated as it requires the flores101 dataset instead of the flores200..
The README for prepare_data
answers this. (https://github.com/facebookresearch/stopes/tree/main/stopes/pipelines/prepare_data) Here's an example config:
binarization_config:
binarize_workers: 60
max_examples_per_shard: 500000000
random_seed: 0
smallest_shard: 500000
executor_config:
cluster: local
log_folder: executor_logs
preprocessing_config:
max_tokens: null
moses_config:
deescape_special_chars: false
lowercase: false
normalize_punctuation: true
remove_non_printing_chars: false
script_directory: <PATH_TO_FAIRSEQ_DIR>/fairseq-py/examples/nllb/modeling/preprocessing/moses
preprocess_source: true
preprocess_target: true
sample_size: null
tag_data: true
source_vocab_config:
pretrained: null
vocab_build_params:
character_coverage: 0.99995
model_type: bpe
random_seed: 0
sampled_data_size: 10000000
sampling_temperature: 1.0
shuffle_input_sentence: true
use_joined_data: true
vocab_size: 8000
target_vocab_config:
pretrained: null
vocab_build_params:
character_coverage: 0.99995
model_type: bpe
random_seed: 0
sampled_data_size: 10000000
sampling_temperature: 1.0
shuffle_input_sentence: true
use_joined_data: true
vocab_size: 8000
test_corpora:
eng-ibo:
values:
flores_devtest:
data_tag: null
is_gzip: false
num_lines: null
source: <PATH>
target: <PATH>
train_corpora:
eng-ibo:
values:
public_bitext:
data_tag: null
is_gzip: false
num_lines: null
source: <PATH>
target: <PATH>
train_mining_corpora: null
train_mmt_bt_corpora: null
train_smt_bt_corpora: null
valid_corpora:
eng-ibo:
values:
flores_dev:
data_tag: null
is_gzip: false
num_lines: null
source: <PATH>
target: <PATH>
The data_path
format is detailed in the README, you need to organize your corpora files in a specific way for them to be read.
Make sure you download the moses scripts in your fairseq directory. The pipeline runs only if this script exists:
examples/nllb/modeling/preprocessing/moses/clean-corpus-n.perl
(https://github.com/facebookresearch/stopes/blob/main/stopes/pipelines/prepare_data/encode_and_binarize.py#L78)
You need to run the filtering pipeline to filter out data based on the following heuristics: length, deduplication, LASER margin score threshold, LID score thresholds, toxicity. It's not sufficient to just get the output of populate_data_conf.py
and compute_length_factors.py
, the output of these scripts are then passed into the filtering pipeline. This is detailed in the README here: https://github.com/facebookresearch/stopes/tree/main/stopes/pipelines/filtering
After filtering, you build a vocabulary (SentencePiece model) and then encode and binarize your data which can then be fed into fairseq for training. The filtered datasets (output of filtering pipeline) should be fed into the prepare_data
pipeline. You're right, we should have the filtering output the data_config for prepare_data. We are working on such changes to refactor both these pipelines. We'll push out a change soon to address that.
You need to run the filtering pipeline to filter out data based on the following heuristics: length, deduplication, LASER margin score threshold, LID score thresholds, toxicity. It's not sufficient to just get the output of
populate_data_conf.py
andcompute_length_factors.py
, the output of these scripts are then passed into the filtering pipeline. This is detailed in the README here: https://github.com/facebookresearch/stopes/tree/main/stopes/pipelines/filteringAfter filtering, you build a vocabulary (SentencePiece model) and then encode and binarize your data which can then be fed into fairseq for training. The filtered datasets (output of filtering pipeline) should be fed into the
prepare_data
pipeline. You're right, we should have the filtering output the data_config for prepare_data. We are working on such changes to refactor both these pipelines. We'll push out a change soon to address that.
Thank you very much for your answer, I read another issue #15, and then I have the same question, how does the prepare_data pipeline use the output of the filtering pipeline? From the parameters of running the prepare_data pipeline, the two seem to be unrelated.
Apologies, you are correct. Currently the filtering pipeline doesn't output the input config of the prepare_data pipeline which is inconvenient for user. We're working on completely refactoring the two pipelines to be well integrated with Stopes, as well as have filtering produce the input config of prepare_data. I'm sorry about that. You can look at the prepare_data input config format in its README and write a short script to create that, based on the filtered src, tgt files for all directions x data sources.