Filtering pipeline produces a config with wrong lang directions

Question

Filtering pipeline produces a config with wrong lang directions

molokanov50 opened this issue 2 years ago · 1 comments

I want to finetune an NLLB model on my own data, so according to my vision, the task is relatively simple - to convert my dataset to fairseq format. So I started to use stopes pipelines. But, despite the directory structure of my dataset implies eng_Latn-rus_Cyrl lang direction, config.yaml at the output of the filtering pipeline lists absolutely other lang pairs.
My dataset consists of 2 files (FTData is a root directory for my dataset):
FTData/eng_Latn-rus_Cyrl/mycorpus.eng_Latn.gz,
FTData/eng_Latn-rus_Cyrl/mycorpus.rus_Cyrl.gz.
Then I run:
python stopes/stopes/pipelines/filtering/scripts/populate_data_conf.py --bt-root bt --mined-data-root mined --primary-train-paths FTData --data-conf-dir ConfOutput train_primary,
where bt and mined are empty directories (since I have initially only my own texts without any preprocessing),
then:
python stopes/stopes/pipelines/filtering/scripts/compute_length_factors.py --data-conf-dir ConfOutput --flores-path flores,
where flores is also an empty dir (since I don't need any external corpora, my goal is to finetune only on my data, but --flores-path is a required param to run compute_length_factors.py, so I think I can indicate an arbitrary directory there),
and lastly:
python stopes/stopes/pipelines/filtering/filter.py output_dir=FTFiltered data_conf_dir=ConfOutput.
My FTFiltered/config.yaml file looks as follows:

data_conf_dir: /home/molokanov/myapp3/ConfOutput
directions:
- eng_Latn-lij_Latn
- eng_Latn-scn_Latn
executor:
  cluster: local
  log_folder: executor_logs
  slurm_partition: null
output_dir: /home/molokanov/myapp3/FTFiltered
train_bt: null
train_mined: null
train_primary:
  dedup_filter:
    _target_: stopes.pipelines.filtering.filters.DedupFilter
    dedup_pairs: true
    max_source_dedup: null
    max_target_dedup: null
  excluded_corpora: null
  included_corpora:
  - nllbseed
  - tatoeba
  laser_filter: null
  length_filter:
    _target_: stopes.pipelines.filtering.filters.LengthFilter
    max_len: 1050
    max_len_ratio: 9.0
    min_len: 5
    min_src_unique_ratio: null
  lid_filter: null
  normalize_punctuation: true
  normalize_unicode: false
  toxicity_filter: null

As you can see, eng_Latn-lij_Latn and eng_Latn-scn_Latn are not contained in my dataset but I got them. In the same time, there is no eng_Latn-rus_Cyrl in my config, but this lang pair is required for me.
Also, I have no understanding why nllbseed and tatoeba are mentioned as included corpora in my config.yaml.

Answer 1 · 2023-08-21T08:44:10.000Z

Your config is not good, remove those 2 directions and add eng_Latn-rus_Cyrl (or more likely I think only eng-rus will work). Same for the opposite direction that you need.
You need to add your corpora as well.
Download the flores dataset and use it to prepare those 2 configs, I end up with something like below: