tokenizer ignored when creating align.priors
tomsbergmanis opened this issue · 2 comments
I use code below to train word_alignments
- type: train_alignment
parameters:
src_data: filtered.fi.gz
tgt_data: filtered.en.gz
parameters:
src_tokenizer: [moses, fi]
tgt_tokenizer: [moses, en]
model: 3
output: align.priors
but when I check align.priors I see that data was not tokenized:
LEX ! "Hm! 1
LEX ! "Misasja!" 2
LEX ! "New 1
LEX ! "Tõepoolest!" 1
LEX ! "ei! 1
LEX ! "kuninganna! 1
LEX ! "tõepoolest!" 1
This seems like a bug, right?
I cannot replicate the problem. For example, with the following configuration, the produced align.priors
looks tokenized - to the extent that the moses tokenizer can do. In any case the difference is very clear if you leave the *_tokenizer
options out. Can you test this?
If it works, then I'd blame the tokenizer library and/or something weird in your data. (Your en
side looks like Estonian, but I think the en
settings for the tokenizer should mostly work.) If it does not, I need details on what software versions are you using.
common:
output_directory: work
steps:
- type: opus_read
parameters:
corpus_name: QED
source_language: fi
target_language: en
release: latest
preprocessing: raw
src_output: fi.raw.gz
tgt_output: en.raw.gz
- type: filter
parameters:
inputs: [fi.raw.gz, en.raw.gz]
outputs: [fi.train.gz, en.train.gz]
filters:
- LengthFilter:
unit: char
min_length: 10
max_length: 500
- LengthRatioFilter:
unit: char
threshold: 3
- type: train_alignment
parameters:
src_data: fi.train.gz
tgt_data: en.train.gz
parameters:
src_tokenizer: [moses, fi]
tgt_tokenizer: [moses, en]
model: 3
output: align.priors
Thanks for your swift answer! I guess it was my mistake.