Generation with alignment dictionary

Question

Generation with alignment dictionary

patrik-lambert opened this issue 7 years ago · 6 comments

Hi, in scripts like generate-lines.lua there is an option (-aligndictpath) to add an alignment dictionary.
How should this dictionary be built?
Thanks!

Answer 1 · 2017-11-28T14:29:34.000Z

Hi there! The alignment dictionary is used for training and decoding with vocabulary selection models. You first create an alignment using scripts/build_sym_alignment.py and then generate the dictionary using scripts/makealigndict.lua.

Answer 2 · 2017-11-28T14:50:45.000Z

I tried this but I got a "bad argument #2 to '?' (out of bounds" error. Actually it seems the script expects binary files with .idx and .bin extensions:

if config.aligndictpath ~= '' then
config.aligndict = tnt.IndexedDatasetReader{
indexfilename = config.aligndictpath .. '.idx',
datafilename = config.aligndictpath .. '.bin',

I also tried with the alignment file produced by preprocess.lua, but I got the same error.

Answer 3 · 2017-11-28T16:50:36.000Z

Indeed, my bad! It should indeed be the alignment file produced by fairseq preprocess. This does not work?

Answer 4 · 2017-11-28T17:33:50.000Z

No, it actually does not work. If I run generate-lines.lua without the -aligndictpath option, it works correctly. If I add the option, ( "-aligndictpath path/alignment", where the alignment files corresponding to the source and target dictionaries are path/alignment.idx and path/alignment.bin), I get the following error:

/HD_4TB_1/Tools/torch/install/bin/luajit: bad argument #2 to '?' (out of bounds at /HD_4TB_1/Tools/torch/pkg/torch/lib/TH/generic/THStorage.c:202)
stack traceback:
[C]: at 0x7fbd7e883b70
[C]: in function '__index'
...nstall/share/lua/5.1/torchnet/dataset/indexeddataset.lua:412: in function '__init'
/HD_4TB_1/Tools/torch/install/share/lua/5.1/torch/init.lua:91: in function </HD_4TB_1/Tools/torch/install/share/lua/5.1/torch/init.lua:87>
[C]: in function 'IndexedDatasetReader'
/home/patrik/soft/fairseq/generate-lines.lua:66: in main chunk
[C]: in function 'dofile'
...ools/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x00405d50

Answer 5 · 2017-12-20T16:13:37.000Z

Hi. I went back to this problem and found out that it is due to this bug:
torch/torch7#1064

After updating torch to the version indicated in the issue page, I could train fairseq with the alignment dictionary produced by fairseq preprocess, and then decode using the alignment dictionary.

Thanks.

Answer 6 · 2018-01-04T23:51:36.000Z

Thanks for letting us know and sorry for not getting back earlier!