facebookresearch/fairseq-lua

Generation with alignment dictionary

patrik-lambert opened this issue · 6 comments

Hi, in scripts like generate-lines.lua there is an option (-aligndictpath) to add an alignment dictionary.
How should this dictionary be built?
Thanks!

Hi there! The alignment dictionary is used for training and decoding with vocabulary selection models. You first create an alignment using scripts/build_sym_alignment.py and then generate the dictionary using scripts/makealigndict.lua.

I tried this but I got a "bad argument #2 to '?' (out of bounds" error. Actually it seems the script expects binary files with .idx and .bin extensions:

if config.aligndictpath ~= '' then
config.aligndict = tnt.IndexedDatasetReader{
indexfilename = config.aligndictpath .. '.idx',
datafilename = config.aligndictpath .. '.bin',

I also tried with the alignment file produced by preprocess.lua, but I got the same error.

Indeed, my bad! It should indeed be the alignment file produced by fairseq preprocess. This does not work?

No, it actually does not work. If I run generate-lines.lua without the -aligndictpath option, it works correctly. If I add the option, ( "-aligndictpath path/alignment", where the alignment files corresponding to the source and target dictionaries are path/alignment.idx and path/alignment.bin), I get the following error:

/HD_4TB_1/Tools/torch/install/bin/luajit: bad argument #2 to '?' (out of bounds at /HD_4TB_1/Tools/torch/pkg/torch/lib/TH/generic/THStorage.c:202)
stack traceback:
[C]: at 0x7fbd7e883b70
[C]: in function '__index'
...nstall/share/lua/5.1/torchnet/dataset/indexeddataset.lua:412: in function '__init'
/HD_4TB_1/Tools/torch/install/share/lua/5.1/torch/init.lua:91: in function </HD_4TB_1/Tools/torch/install/share/lua/5.1/torch/init.lua:87>
[C]: in function 'IndexedDatasetReader'
/home/patrik/soft/fairseq/generate-lines.lua:66: in main chunk
[C]: in function 'dofile'
...ools/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x00405d50

Hi. I went back to this problem and found out that it is due to this bug:
torch/torch7#1064

After updating torch to the version indicated in the issue page, I could train fairseq with the alignment dictionary produced by fairseq preprocess, and then decode using the alignment dictionary.

Thanks.

Thanks for letting us know and sorry for not getting back earlier!