viig99/SymSpellCppPy

Umlauts need additional edit distance

Opened this issue · 0 comments

I built a test that looks like this:

def test_umlauts(self):
  dictionary_path = os.path.join(self.fortests_path, "umlaut_dict.txt")
  
  edit_distance_max = 1
  prefix_length = 5
  sym_spell = SymSpell(edit_distance_max, prefix_length)
  sym_spell.load_dictionary(dictionary_path, 0, 1)
  
  result = sym_spell.lookup("dämen", Verbosity.TOP, 2)
  self.assertEqual(1, len(result))
  self.assertEqual("damen", result[0].term)

With a dictionary that contains only this line: damen 1

However this test fails with edit_distance_max = 1 and passes with edit_distance_max = 2 even though there is only 1 character changed from dämen to damen

It seems like there is a bug so that umlauts like 'ä' are being interpreted as 'ae' or something like that?

If anyone has an idea where to look I'd gladly try to fix it but I haven't found anything yet.