facebookresearch/stopes

spm-200 dictionary duplicate error

edchengg opened this issue · 4 comments

Hi,

I tried to run prepare_data.py with the following config but met duplication error during dictionary loading. I noticed the dictionary size is more than doubled from the pretrained SPM-200 vocab size.

The original dictionary.txt I downloaded from here https://github.com/facebookresearch/fairseq/tree/nllb/examples/nllb/data
has a vocab size of 255997. Screen Shot 2022-12-20 at 5 21 55 PM

But in the output_data/dictionary.source.dict.txt, the size is 511987 (this is in my output directory).
Screen Shot 2022-12-20 at 5 08 03 PM

What should be the right config to use pretrained SPM-200 model to encode data?
Thanks! @kauterry @Mortimerp9

===Update===
I replaced the vocab (dictionary.source.dict.txt, dictionary.taget.dict.txt) in my output dir with the original dictionary.txt and comment out the codelines in prepare_vocab.py.
It seems to be working now and I was able to get output files. But I still want to know the right way to use NLLB.. Thanks!

  eng_Latn-zho_Hans:
    values:
      nllb_ner_mark_corpus:
        is_gzip: false
        source: nllb_ner_mark_corpus/eng_Latn-zho_Hans/nllb_ner_mark_corpus.eng_Latn
        target: nllb_ner_mark_corpus/eng_Latn-zho_Hans/nllb_ner_mark_corpus.zho_Hans
  eng_Latn-jpn_Jpan:
    values:
      nllb_ner_mark_corpus:
        is_gzip: false
        source: nllb_ner_mark_corpus/eng_Latn-jpn_Jpan/nllb_ner_mark_corpus.eng_Latn
        target: nllb_ner_mark_corpus/eng_Latn-jpn_Jpan/nllb_ner_mark_corpus.jpn_Jpan

train_mining_corpora: null
train_smt_bt_corpora: null
train_mmt_bt_corpora: null
valid_corpora: null
test_corpora: null

source_vocab_config:
  pretrained:
      model_file: stopes/stopes/pipelines/prepare_data/flores200_sacrebleu_tokenizer_spm.model
      vocab_file: stopes/stopes/pipelines/prepare_data/dictionary.txt
  vocab_build_params:
    vocab_size: 255997
    use_joined_data: false
    model_type: bpe

target_vocab_config:
  pretrained:
      model_file: stopes/stopes/pipelines/prepare_data/flores200_sacrebleu_tokenizer_spm.model
      vocab_file: stopes/stopes/pipelines/prepare_data/dictionary.txt
  vocab_build_params:
    vocab_size: 255997
    use_joined_data: false
    model_type: bpe
    
binarization_config:
  binarize_workers: 6
  max_examples_per_shard: 5000000
  random_seed: 0
  smallest_shard: 1

preprocessing_config:
  moses_config:
    script_directory: stopes/stopes/pipelines/prepare_data/mose_script
    lowercase: false
    normalize_punctuation: true
    remove_non_printing_chars: false
    deescape_special_chars: false

executor_config:
  cluster: local
  log_folder: /tmp```

**but met the error:**

`2022-12-20 16:47:14 | INFO | fairseq_cli.preprocess | Namespace(aim_repo=None, aim_run_hash=None, align_suffix=None, alignfile=None, all_gather_list_size=16384, amp=False, amp_batch_retries=2, amp_init_scale=128, amp_scale_window=None, azureml_logging=False, bf16=False, bpe=None, cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='output_data/data_bin/shard000', dict_only=False, empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=False, log_file=None, log_format=None, log_interval=100, lr_scheduler='fixed', memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, on_cpu_convert_precision=False, only_source=False, optimizer=None, padding_factor=8, plasma_path='/tmp/plasma', profile=False, quantization_config_path=None, reset_logging=False, scoring='bleu', seed=1, source_lang='eng_Latn', srcdict='output_data/dictionary.source.dict.txt', suppress_crashes=False, target_lang='zho_Hans', task='translation', tensorboard_logdir=None, testpref=None, tgtdict='output_data/dictionary.target.dict.txt', threshold_loss_scale=None, thresholdsrc=0, thresholdtgt=0, tokenizer=None, tpu=False, trainpref='output_data/tmp/encoded_filtered_train/shard000/spm_length_filtered_train.eng_Latn-zho_Hans', use_plasma_view=False, user_dir=None, validpref=None, wandb_project=None, workers=6)

Traceback (most recent call last):
  File "/srv/scratch/ychen3411/anaconda3/envs/nllb/bin/fairseq-preprocess", line 8, in <module>
    sys.exit(cli_main())
  File "/srv/scratch/ychen3411/anaconda3/envs/nllb/lib/python3.8/site-packages/fairseq_cli/preprocess.py", line 389, in cli_main
    main(args)
  File "/srv/scratch/ychen3411/anaconda3/envs/nllb/lib/python3.8/site-packages/fairseq_cli/preprocess.py", line 335, in main
    src_dict = task.load_dictionary(args.srcdict)
  File "/srv/scratch/ychen3411/anaconda3/envs/nllb/lib/python3.8/site-packages/fairseq/tasks/fairseq_task.py", line 94, in load_dictionary
    return Dictionary.load(filename)
  File "/srv/scratch/ychen3411/anaconda3/envs/nllb/lib/python3.8/site-packages/fairseq/data/dictionary.py", line 226, in load
    d.add_from_file(f)
  File "/srv/scratch/ychen3411/anaconda3/envs/nllb/lib/python3.8/site-packages/fairseq/data/dictionary.py", line 237, in add_from_file
    self.add_from_file(fd)
  File "/srv/scratch/ychen3411/anaconda3/envs/nllb/lib/python3.8/site-packages/fairseq/data/dictionary.py", line 261, in add_from_file
    raise RuntimeError(
RuntimeError: Duplicate word found when loading Dictionary: ''. Duplicate words can overwrite earlier ones by adding the #fairseq:overwrite flag at the end of the corresponding row in the dictionary file. If using the Camembert model, please download an updated copy of the model file.

`

and I don't understand why would this function skip the first 3 lines from a pretrained dictionary.txt as the first 3 lines of spm-200 seems to be valid token:

for line in vocab_f.readlines()[3:]:
https://github.com/facebookresearch/stopes/blob/main/stopes/pipelines/prepare_data/prepare_vocab.py#L80

This is the preprocess.log I got:
it shows "[eng_Latn] Dictionary: 256001 types"
can someone confirm this is the right size of spm-200? Thanks!

`Namespace(aim_repo=None, aim_run_hash=None, align_suffix=None, alignfile=None, all_gather_list_size=16384, amp=False, amp_batch_retries=2, amp_init_scale=128, amp_scale_window=None, azureml_logging=False, bf16=False, bpe=None, cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='output_data/data_bin/shard000', dict_only=False, empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=False, log_file=None, log_format=None, log_interval=100, lr_scheduler='fixed', memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, on_cpu_convert_precision=False, only_source=False, optimizer=None, padding_factor=8, plasma_path='/tmp/plasma', profile=False, quantization_config_path=None, reset_logging=False, scoring='bleu', seed=1, source_lang='eng_Latn', srcdict='output_data/dictionary.source.dict.txt', suppress_crashes=False, target_lang='jpn_Jpan', task='translation', tensorboard_logdir=None, testpref=None, tgtdict='output_data/dictionary.target.dict.txt', threshold_loss_scale=None, thresholdsrc=0, thresholdtgt=0, tokenizer=None, tpu=False, trainpref='output_data/tmp/encoded_filtered_train/shard000/spm_length_filtered_train.eng_Latn-jpn_Jpan', use_plasma_view=False, user_dir=None, validpref=None, wandb_project=None, workers=6)

[eng_Latn] Dictionary: 256001 types

[eng_Latn] output_data/tmp/encoded_filtered_train/shard000/spm_length_filtered_train.eng_Latn-jpn_Jpan.eng_Latn: 29293 sents, 674224 tokens, 0.00148% replaced (by )

[jpn_Jpan] Dictionary: 256001 types

[jpn_Jpan] output_data/tmp/encoded_filtered_train/shard000/spm_length_filtered_train.eng_Latn-jpn_Jpan.jpn_Jpan: 29293 sents, 690135 tokens, 0.0556% replaced (by )

Wrote preprocessed data to output_data/data_bin/shard000
Namespace(aim_repo=None, aim_run_hash=None, align_suffix=None, alignfile=None, all_gather_list_size=16384, amp=False, amp_batch_retries=2, amp_init_scale=128, amp_scale_window=None, azureml_logging=False, bf16=False, bpe=None, cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='output_data/data_bin/shard000', dict_only=False, empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=False, log_file=None, log_format=None, log_interval=100, lr_scheduler='fixed', memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, on_cpu_convert_precision=False, only_source=False, optimizer=None, padding_factor=8, plasma_path='/tmp/plasma', profile=False, quantization_config_path=None, reset_logging=False, scoring='bleu', seed=1, source_lang='eng_Latn', srcdict='output_data/dictionary.source.dict.txt', suppress_crashes=False, target_lang='zho_Hans', task='translation', tensorboard_logdir=None, testpref=None, tgtdict='output_data/dictionary.target.dict.txt', threshold_loss_scale=None, thresholdsrc=0, thresholdtgt=0, tokenizer=None, tpu=False, trainpref='output_data/tmp/encoded_filtered_train/shard000/spm_length_filtered_train.eng_Latn-zho_Hans', use_plasma_view=False, user_dir=None, validpref=None, wandb_project=None, workers=6)

[eng_Latn] Dictionary: 256001 types

[eng_Latn] output_data/tmp/encoded_filtered_train/shard000/spm_length_filtered_train.eng_Latn-zho_Hans.eng_Latn: 37759 sents, 950207 tokens, 0.00021% replaced (by )

[zho_Hans] Dictionary: 256001 types

[zho_Hans] output_data/tmp/encoded_filtered_train/shard000/spm_length_filtered_train.eng_Latn-zho_Hans.zho_Hans: 37759 sents, 959854 tokens, 0.493% replaced (by )

Wrote preprocessed data to output_data/data_bin/shard000
`

Hi @edchengg, I want to fine tune the NLLB-200 on new data.

I have the same error you had in the first place (the double size of the dictionary). I did what you said but it doesn't work for me. And I have another problem, the output /data_bin/shard000/ just contains the preprocess.ary_Arab-eng_Latn.log which is empty and preprocess.log file.

image

This me config.yaml file:
image

I want to know if you have succeeded in fine-tuning NLLB-200, and you can help me.
Thank you.

@ibtiRaj I gave up and switched to huggingface