Alignments for BPE token

Hi, I just wonder whether simalign supports the feature of extracting alignments at BPE level?

Hi,
I think the easy way could be to give BPE-segmented text as input (instead of word-segmented).
Then the model treats the BPEs as words.

The other way is to edit the code:

simalign/simalign/simalign.py

Line 232 in 05332bf

aligns[ext].add((l1_b2w_map[i], l2_b2w_map[j]))

In this line, we convert the aligned BPE indexes (i, j) to word indexes for source and target tokens.
You can just keep (i, j) and not convert them.
You can find the mappings in l1_b2w_map and l2_b2w_map.

Many thanks for your quick reply. A concern is that it requires using the same tokenizer of the basic pretrained model, e.g., mBERT, for my personal task, right?

Anyway, I will try it later following your suggestion and let you know whether it does work.