spm-200 dictionary duplicate error
edchengg opened this issue · 4 comments
Hi,
I tried to run prepare_data.py with the following config but met duplication error during dictionary loading. I noticed the dictionary size is more than doubled from the pretrained SPM-200 vocab size.
The original dictionary.txt I downloaded from here https://github.com/facebookresearch/fairseq/tree/nllb/examples/nllb/data
has a vocab size of 255997.
But in the output_data/dictionary.source.dict.txt, the size is 511987 (this is in my output directory).
What should be the right config to use pretrained SPM-200 model to encode data?
Thanks! @kauterry @Mortimerp9
===Update===
I replaced the vocab (dictionary.source.dict.txt, dictionary.taget.dict.txt) in my output dir with the original dictionary.txt and comment out the codelines in prepare_vocab.py.
It seems to be working now and I was able to get output files. But I still want to know the right way to use NLLB.. Thanks!
eng_Latn-zho_Hans:
values:
nllb_ner_mark_corpus:
is_gzip: false
source: nllb_ner_mark_corpus/eng_Latn-zho_Hans/nllb_ner_mark_corpus.eng_Latn
target: nllb_ner_mark_corpus/eng_Latn-zho_Hans/nllb_ner_mark_corpus.zho_Hans
eng_Latn-jpn_Jpan:
values:
nllb_ner_mark_corpus:
is_gzip: false
source: nllb_ner_mark_corpus/eng_Latn-jpn_Jpan/nllb_ner_mark_corpus.eng_Latn
target: nllb_ner_mark_corpus/eng_Latn-jpn_Jpan/nllb_ner_mark_corpus.jpn_Jpan
train_mining_corpora: null
train_smt_bt_corpora: null
train_mmt_bt_corpora: null
valid_corpora: null
test_corpora: null
source_vocab_config:
pretrained:
model_file: stopes/stopes/pipelines/prepare_data/flores200_sacrebleu_tokenizer_spm.model
vocab_file: stopes/stopes/pipelines/prepare_data/dictionary.txt
vocab_build_params:
vocab_size: 255997
use_joined_data: false
model_type: bpe
target_vocab_config:
pretrained:
model_file: stopes/stopes/pipelines/prepare_data/flores200_sacrebleu_tokenizer_spm.model
vocab_file: stopes/stopes/pipelines/prepare_data/dictionary.txt
vocab_build_params:
vocab_size: 255997
use_joined_data: false
model_type: bpe
binarization_config:
binarize_workers: 6
max_examples_per_shard: 5000000
random_seed: 0
smallest_shard: 1
preprocessing_config:
moses_config:
script_directory: stopes/stopes/pipelines/prepare_data/mose_script
lowercase: false
normalize_punctuation: true
remove_non_printing_chars: false
deescape_special_chars: false
executor_config:
cluster: local
log_folder: /tmp```
**but met the error:**
`2022-12-20 16:47:14 | INFO | fairseq_cli.preprocess | Namespace(aim_repo=None, aim_run_hash=None, align_suffix=None, alignfile=None, all_gather_list_size=16384, amp=False, amp_batch_retries=2, amp_init_scale=128, amp_scale_window=None, azureml_logging=False, bf16=False, bpe=None, cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='output_data/data_bin/shard000', dict_only=False, empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=False, log_file=None, log_format=None, log_interval=100, lr_scheduler='fixed', memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, on_cpu_convert_precision=False, only_source=False, optimizer=None, padding_factor=8, plasma_path='/tmp/plasma', profile=False, quantization_config_path=None, reset_logging=False, scoring='bleu', seed=1, source_lang='eng_Latn', srcdict='output_data/dictionary.source.dict.txt', suppress_crashes=False, target_lang='zho_Hans', task='translation', tensorboard_logdir=None, testpref=None, tgtdict='output_data/dictionary.target.dict.txt', threshold_loss_scale=None, thresholdsrc=0, thresholdtgt=0, tokenizer=None, tpu=False, trainpref='output_data/tmp/encoded_filtered_train/shard000/spm_length_filtered_train.eng_Latn-zho_Hans', use_plasma_view=False, user_dir=None, validpref=None, wandb_project=None, workers=6)
Traceback (most recent call last):
File "/srv/scratch/ychen3411/anaconda3/envs/nllb/bin/fairseq-preprocess", line 8, in <module>
sys.exit(cli_main())
File "/srv/scratch/ychen3411/anaconda3/envs/nllb/lib/python3.8/site-packages/fairseq_cli/preprocess.py", line 389, in cli_main
main(args)
File "/srv/scratch/ychen3411/anaconda3/envs/nllb/lib/python3.8/site-packages/fairseq_cli/preprocess.py", line 335, in main
src_dict = task.load_dictionary(args.srcdict)
File "/srv/scratch/ychen3411/anaconda3/envs/nllb/lib/python3.8/site-packages/fairseq/tasks/fairseq_task.py", line 94, in load_dictionary
return Dictionary.load(filename)
File "/srv/scratch/ychen3411/anaconda3/envs/nllb/lib/python3.8/site-packages/fairseq/data/dictionary.py", line 226, in load
d.add_from_file(f)
File "/srv/scratch/ychen3411/anaconda3/envs/nllb/lib/python3.8/site-packages/fairseq/data/dictionary.py", line 237, in add_from_file
self.add_from_file(fd)
File "/srv/scratch/ychen3411/anaconda3/envs/nllb/lib/python3.8/site-packages/fairseq/data/dictionary.py", line 261, in add_from_file
raise RuntimeError(
RuntimeError: Duplicate word found when loading Dictionary: ''. Duplicate words can overwrite earlier ones by adding the #fairseq:overwrite flag at the end of the corresponding row in the dictionary file. If using the Camembert model, please download an updated copy of the model file.
`
and I don't understand why would this function skip the first 3 lines from a pretrained dictionary.txt as the first 3 lines of spm-200 seems to be valid token:
for line in vocab_f.readlines()[3:]:
https://github.com/facebookresearch/stopes/blob/main/stopes/pipelines/prepare_data/prepare_vocab.py#L80
This is the preprocess.log I got:
it shows "[eng_Latn] Dictionary: 256001 types"
can someone confirm this is the right size of spm-200? Thanks!
`Namespace(aim_repo=None, aim_run_hash=None, align_suffix=None, alignfile=None, all_gather_list_size=16384, amp=False, amp_batch_retries=2, amp_init_scale=128, amp_scale_window=None, azureml_logging=False, bf16=False, bpe=None, cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='output_data/data_bin/shard000', dict_only=False, empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=False, log_file=None, log_format=None, log_interval=100, lr_scheduler='fixed', memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, on_cpu_convert_precision=False, only_source=False, optimizer=None, padding_factor=8, plasma_path='/tmp/plasma', profile=False, quantization_config_path=None, reset_logging=False, scoring='bleu', seed=1, source_lang='eng_Latn', srcdict='output_data/dictionary.source.dict.txt', suppress_crashes=False, target_lang='jpn_Jpan', task='translation', tensorboard_logdir=None, testpref=None, tgtdict='output_data/dictionary.target.dict.txt', threshold_loss_scale=None, thresholdsrc=0, thresholdtgt=0, tokenizer=None, tpu=False, trainpref='output_data/tmp/encoded_filtered_train/shard000/spm_length_filtered_train.eng_Latn-jpn_Jpan', use_plasma_view=False, user_dir=None, validpref=None, wandb_project=None, workers=6)
[eng_Latn] Dictionary: 256001 types
[eng_Latn] output_data/tmp/encoded_filtered_train/shard000/spm_length_filtered_train.eng_Latn-jpn_Jpan.eng_Latn: 29293 sents, 674224 tokens, 0.00148% replaced (by )
[jpn_Jpan] Dictionary: 256001 types
[jpn_Jpan] output_data/tmp/encoded_filtered_train/shard000/spm_length_filtered_train.eng_Latn-jpn_Jpan.jpn_Jpan: 29293 sents, 690135 tokens, 0.0556% replaced (by )
Wrote preprocessed data to output_data/data_bin/shard000
Namespace(aim_repo=None, aim_run_hash=None, align_suffix=None, alignfile=None, all_gather_list_size=16384, amp=False, amp_batch_retries=2, amp_init_scale=128, amp_scale_window=None, azureml_logging=False, bf16=False, bpe=None, cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='output_data/data_bin/shard000', dict_only=False, empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=False, log_file=None, log_format=None, log_interval=100, lr_scheduler='fixed', memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, on_cpu_convert_precision=False, only_source=False, optimizer=None, padding_factor=8, plasma_path='/tmp/plasma', profile=False, quantization_config_path=None, reset_logging=False, scoring='bleu', seed=1, source_lang='eng_Latn', srcdict='output_data/dictionary.source.dict.txt', suppress_crashes=False, target_lang='zho_Hans', task='translation', tensorboard_logdir=None, testpref=None, tgtdict='output_data/dictionary.target.dict.txt', threshold_loss_scale=None, thresholdsrc=0, thresholdtgt=0, tokenizer=None, tpu=False, trainpref='output_data/tmp/encoded_filtered_train/shard000/spm_length_filtered_train.eng_Latn-zho_Hans', use_plasma_view=False, user_dir=None, validpref=None, wandb_project=None, workers=6)
[eng_Latn] Dictionary: 256001 types
[eng_Latn] output_data/tmp/encoded_filtered_train/shard000/spm_length_filtered_train.eng_Latn-zho_Hans.eng_Latn: 37759 sents, 950207 tokens, 0.00021% replaced (by )
[zho_Hans] Dictionary: 256001 types
[zho_Hans] output_data/tmp/encoded_filtered_train/shard000/spm_length_filtered_train.eng_Latn-zho_Hans.zho_Hans: 37759 sents, 959854 tokens, 0.493% replaced (by )
Wrote preprocessed data to output_data/data_bin/shard000
`
Hi @edchengg, I want to fine tune the NLLB-200 on new data.
I have the same error you had in the first place (the double size of the dictionary). I did what you said but it doesn't work for me. And I have another problem, the output /data_bin/shard000/ just contains the preprocess.ary_Arab-eng_Latn.log which is empty and preprocess.log file.
I want to know if you have succeeded in fine-tuning NLLB-200, and you can help me.
Thank you.