WojciechMula/pyahocorasick

Weird failure on unicode/windows or bytes/linux build

pombredanne opened this issue · 3 comments

The windows tests with a unicode build and the linux tests with a non-unicode are failing this test:

On bytes/linux:

_____________________________________________________ TestTrieIterators.test_items _____________________________________________________

self = <test_unit.TestTrieIterators testMethod=test_items>

    def test_items(self):
        A = self.A
        I = []
        for i, w in enumerate(self.words):
            A.add_word(conv(w), i + 1)
            I.append((conv(w), i + 1))
    
        L = [x for x in A.items()]
        self.assertEqual(len(L), len(I))
>       self.assertEqual(set(L), set(I))
E       AssertionError: Items in the first set but not the second:
E       (b'a\x00h', 3)
E       (b'p\x00y\x00t\x00', 2)
E       (b'c\x00o\x00r\x00a\x00', 4)
E       (b'w\x00o\x00', 1)
E       Items in the second set but not the first:
E       (b'word', 1)
E       (b'python', 2)
E       (b'aho', 3)
E       (b'corasick', 4)

tests/test_unit.py:431: AssertionError

on windows/unicode:

 ________________________ TestTrieIterators.test_items _________________________
  
  self = <test_unit.TestTrieIterators testMethod=test_items>
  
      def test_items(self):
          A = self.A
          I = []
          for i, w in enumerate(self.words):
              A.add_word(conv(w), i + 1)
              I.append((conv(w), i + 1))
      
          L = [x for x in A.items()]
          self.assertEqual(len(L), len(I))
  >       self.assertEqual(set(L), set(I))
  E       AssertionError: Items in the first set but not the second:
  E       ('w\x00o\x00', 1)
  E       ('p\x00y\x00t\x00', 2)
  E       ('a\x00h', 3)
  E       ('c\x00o\x00r\x00a\x00', 4)
  E       Items in the second set but not the first:
  E       ('corasick', 4)
  E       ('python', 2)
  E       ('aho', 3)
  E       ('word', 1)
  
  D:\a\pyahocorasick\pyahocorasick\tests\test_unit.py:422: AssertionError

I wonder if this is because there are some narrow vs. wide Python unicode builds done on windows?

It feels as if a null was being injected after each letter and as if Windows was built with bytes and not the unicode define.

It feels as if a null was being injected after each letter and as if Windows was built with bytes and not the unicode define.

True, it looks as you described. I have no windows machine to check this.

True, it looks as you described. I have no windows machine to check this.

No worries! I am looking into this with tests ... and will push some investigation in my WIP branch for 2.0
The issue is also on Linux FWIW