make_tokenizer doesn't deal with binary tokenizers

Question

make_tokenizer doesn't deal with binary tokenizers

gsnedders opened this issue 8 years ago · 0 comments

tokenize = make_tokenizer([
    (u'x', (br'\xff\n',)),
])

tokens = list(tokenize(b"\xff\n"))

throws

  File "/Users/gsnedders/Documents/other-projects/funcparserlib/funcparserlib/funcparserlib/tests/test_parsing.py", line 76, in test_tokenize_bytes
    tokens = list(tokenize(b"\xff\n"))
  File "/Users/gsnedders/Documents/other-projects/funcparserlib/funcparserlib/funcparserlib/lexer.py", line 107, in f
    t = match_specs(compiled, str, i, (line, pos))
  File "/Users/gsnedders/Documents/other-projects/funcparserlib/funcparserlib/funcparserlib/lexer.py", line 91, in match_specs
    nls = value.count(u'\n')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)

match_specs needs to handle unicode and bytes line feed characters.