Umlauts need additional edit distance
Opened this issue · 0 comments
then4p commented
I built a test that looks like this:
def test_umlauts(self):
dictionary_path = os.path.join(self.fortests_path, "umlaut_dict.txt")
edit_distance_max = 1
prefix_length = 5
sym_spell = SymSpell(edit_distance_max, prefix_length)
sym_spell.load_dictionary(dictionary_path, 0, 1)
result = sym_spell.lookup("dämen", Verbosity.TOP, 2)
self.assertEqual(1, len(result))
self.assertEqual("damen", result[0].term)
With a dictionary that contains only this line: damen 1
However this test fails with edit_distance_max = 1
and passes with edit_distance_max = 2
even though there is only 1 character changed from dämen to damen
It seems like there is a bug so that umlauts like 'ä' are being interpreted as 'ae' or something like that?
If anyone has an idea where to look I'd gladly try to fix it but I haven't found anything yet.