Filtering pipeline produces a config with wrong lang directions
molokanov50 opened this issue · 1 comments
I want to finetune an NLLB model on my own data, so according to my vision, the task is relatively simple - to convert my dataset to fairseq format. So I started to use stopes pipelines. But, despite the directory structure of my dataset implies eng_Latn-rus_Cyrl
lang direction, config.yaml
at the output of the filtering pipeline lists absolutely other lang pairs.
My dataset consists of 2 files (FTData is a root directory for my dataset):
FTData/eng_Latn-rus_Cyrl/mycorpus.eng_Latn.gz
,
FTData/eng_Latn-rus_Cyrl/mycorpus.rus_Cyrl.gz
.
Then I run:
python stopes/stopes/pipelines/filtering/scripts/populate_data_conf.py --bt-root bt --mined-data-root mined --primary-train-paths FTData --data-conf-dir ConfOutput train_primary
,
where bt
and mined
are empty directories (since I have initially only my own texts without any preprocessing),
then:
python stopes/stopes/pipelines/filtering/scripts/compute_length_factors.py --data-conf-dir ConfOutput --flores-path flores
,
where flores
is also an empty dir (since I don't need any external corpora, my goal is to finetune only on my data, but --flores-path
is a required param to run compute_length_factors.py
, so I think I can indicate an arbitrary directory there),
and lastly:
python stopes/stopes/pipelines/filtering/filter.py output_dir=FTFiltered data_conf_dir=ConfOutput
.
My FTFiltered/config.yaml
file looks as follows:
data_conf_dir: /home/molokanov/myapp3/ConfOutput
directions:
- eng_Latn-lij_Latn
- eng_Latn-scn_Latn
executor:
cluster: local
log_folder: executor_logs
slurm_partition: null
output_dir: /home/molokanov/myapp3/FTFiltered
train_bt: null
train_mined: null
train_primary:
dedup_filter:
_target_: stopes.pipelines.filtering.filters.DedupFilter
dedup_pairs: true
max_source_dedup: null
max_target_dedup: null
excluded_corpora: null
included_corpora:
- nllbseed
- tatoeba
laser_filter: null
length_filter:
_target_: stopes.pipelines.filtering.filters.LengthFilter
max_len: 1050
max_len_ratio: 9.0
min_len: 5
min_src_unique_ratio: null
lid_filter: null
normalize_punctuation: true
normalize_unicode: false
toxicity_filter: null
As you can see, eng_Latn-lij_Latn
and eng_Latn-scn_Latn
are not contained in my dataset but I got them. In the same time, there is no eng_Latn-rus_Cyrl
in my config, but this lang pair is required for me.
Also, I have no understanding why nllbseed and tatoeba are mentioned as included corpora in my config.yaml
.
- Your config is not good, remove those 2 directions and add
eng_Latn-rus_Cyrl
(or more likely I think onlyeng-rus
will work). Same for the opposite direction that you need. - You need to add your corpora as well.
- Download the flores dataset and use it to prepare those 2 configs, I end up with something like below: