cisnlp/simalign

How to get matchings from alignment

Closed this issue · 2 comments

I have the following example:
Sentence A: a # 9.8 m deficit recorded for 2014/15 at an essex hospital is to be investigated by a health service watchdog.
Sentence B: A £9.8m deficit recorded for 2014/15 at an Essex hospital is to be investigated by a health service watchdog.

When I run the following:
myaligner = simalign.SentenceAligner(token_type="word")
aligns = myaligner.get_word_aligns(sentence_A, sentence_B)['itermax']

This produces an aligns of the form:
[(0, 0), (2, 1), (4, 2), (5, 3), (6, 4), (7, 5), (8, 6), (9, 7), (10, 8), (11, 9), (12, 10), (13, 11), (14, 12), (15, 13), (16, 14), (17, 15), (18, 16), (19, 17), (20, 18)]

I cannot figure out how you then produce a matching of the form:
[(0, 0), (1, 1), (2, 1), (3,1) (4, 2), (5, 3), (6, 4), (7, 5), (8, 6), (9, 7), (10, 8), (11, 9), (12, 10), (13, 11), (14, 12), (15, 13), (16, 14), (17, 15), (18, 16), (19, 17), (20, 18)]

This is done on the interactive website in order to produce the graphs but I cannot find where you do something of this form in the code provided.

Thanks in advance!

@VanderpoelLiam the alignments on the website are computed on the subword level. Do you also get different results when you set token_type="bpe"?

Thanks! Adding token_type="bpe" fixes my issue