Alignments for BPE token
Closed this issue · 2 comments
moore3930 commented
Hi, I just wonder whether simalign supports the feature of extracting alignments at BPE level?
masoudjs commented
Hi,
I think the easy way could be to give BPE-segmented text as input (instead of word-segmented).
Then the model treats the BPEs as words.
The other way is to edit the code:
Line 232 in 05332bf
In this line, we convert the aligned BPE indexes (i, j) to word indexes for source and target tokens.
You can just keep (i, j) and not convert them.
You can find the mappings in l1_b2w_map and l2_b2w_map.
moore3930 commented
Many thanks for your quick reply. A concern is that it requires using the same tokenizer of the basic pretrained model, e.g., mBERT, for my personal task, right?
Anyway, I will try it later following your suggestion and let you know whether it does work.