pytries/marisa-trie

Repeated keys

hulikau opened this issue · 2 comments

import marisa_trie
a = [(u'1', '1'), (u'1', '2')]
tr = marisa_trie.BytesTrie(a)
print tr.keys()

This will output [u'1', u'1'].

I guess, that, this function should returns a [u'1']

I'm ready to fix it, if someone consider, that this is bug.

Woah, looks like a bug to me. Interestingly, it is not present in marisa_trie.Trie:

>>> marisa_trie.Trie(["foo", "foo"]).keys()
['foo']
>>> marisa_trie.BytesTrie([("foo", b"1"), ("foo", b"2")]).keys()
['foo', 'foo']

@superbobry
In the same time in tests this situation is expectable: https://github.com/pytries/marisa-trie/blob/0.7.5/tests/test_bytes_trie.py#L106.

By the way I understood why this happens:
For Trie class everything is perfect. For BytesTrie and others inherited from him, in the constructor pairs (key, value) transform to new_key = key + separator + value. Inside the methods keys and iterkeys you iterate over the set of new_keys and then cut old key back.

for i in range(0, ag.key().length()):
  if raw_key[i] == self._c_value_separator:
    key = raw_key[:i].decode('utf8')

And then you add the same key to the res several times.