pytorch/text

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3

PetrochukM opened this issue · 13 comments

Tried opening a text/plain; charset=utf-8 file with torchtext.

# file -i data/simple_questions_wikidata/train.tsv
data/simple_questions_wikidata/train.tsv: text/plain; charset=utf-8

Got this stack trace:

Traceback (most recent call last):
  File "src/jobs/seq2seq/train.py", line 234, in <module>
    fields=[('input', input_field), ('output', output_field)])
  File "/usr/local/lib/python3.5/dist-packages/torchtext/data/dataset.py", line 56, in splits
    train_data = None if train is None else cls(path + train, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/torchtext/data/dataset.py", line 107, in __init__
    for line in f]
  File "/usr/local/lib/python3.5/dist-packages/torchtext/data/dataset.py", line 106, in <listcomp>
    make_example(line.decode('utf-8') if six.PY2 else line, fields)
  File "/usr/lib/python3.5/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0

Fixed with:
with open(os.path.expanduser(path), encoding='utf-8') as f:

Here: https://github.com/pytorch/text/blob/master/torchtext/data/dataset.py#L104

I'm confused. You're on Python3.5, but it looks like the line.decode('utf-8') branch (in line 106) ran, which is behind a six.PY2 condition that should be False. Any idea what's going on there? Maybe insert some print statements?

Yeah, I'd expect the fix you mentioned (encoding argument to open) to be a python 2 fix. What's the value of the $LC_ALL environment variable on your system / sys.getdefaultencoding()?

If I had to guess, I'd say maybe you're still running an older version of torchtext (e.g. in a Python session you've had open for a while) but the code in the dist-packages folder has been updated (and the traceback pulls from there rather than what's actually running).

Version of torchtext: (Most recent)

$git log
commit df7b391d3c02471a2095170ee83c9de4586930e7
Author: Nelson Liu <nelson.liu.2009@gmail.com>
Date:   Fri Jul 14 15:48:45 2017 -0700

    Fix lint

commit f411d83ecf63936d7f4062b9bbc1a667a07f2caf
Author: Nelson Liu <nelson.liu.2009@gmail.com>
Date:   Fri Jul 14 15:48:17 2017 -0700

    Add non-regression test

@jekbradbury
Reinstall of torchtext:

Installed /usr/local/lib/python3.5/dist-packages/torchtext-0.1.1-py3.5.egg
Processing dependencies for torchtext==0.1.1
Finished processing dependencies for torchtext==0.1.1

@nelson-liu
System:

>>> import sys
>>> sys.getdefaultencoding()
'ascii'

Got the same error:

Traceback (most recent call last):
  File "src/jobs/seq2seq/train.py", line 234, in <module>
    fields=[('input', input_field), ('output', output_field)])
  File "/usr/local/lib/python3.5/dist-packages/torchtext-0.1.1-py3.5.egg/torchtext/data/dataset.py", line 56, in splits
  File "/usr/local/lib/python3.5/dist-packages/torchtext-0.1.1-py3.5.egg/torchtext/data/dataset.py", line 107, in __init__
  File "/usr/local/lib/python3.5/dist-packages/torchtext-0.1.1-py3.5.egg/torchtext/data/dataset.py", line 106, in <listcomp>
  File "/usr/lib/python3.5/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 573: ordinal not in range(128)

Replicated the error in terminal:

>>> f = open('/root/qa/data/simple_questions_wikidata/train.tsv', 'r')
>>> [line for line in f]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <listcomp>
  File "/usr/lib/python3.5/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 573: ordinal not in range(128)
>>> import sys
>>> sys.version_info
sys.version_info(major=3, minor=5, micro=2, releaselevel='final', serial=0)

yes, the sys.getdefaultencoding() looks unexpected. Python 3 changed the system encoding to default to utf-8, but only when LC_CTYPE is unicode-aware.

I'm betting that echo $LANG and echo $LC_CTYPE will print C or something on your machine -- try setting these environment variables beforehand and let me know how that goes:

export LANG=en_US.UTF-8
export LC_ALL=en_US.UTF-8

Running this in Docker on a GPU machine.

Tried echo $LANG:

# echo $LANG
en_US.UTF-8
# echo $LC_CTYPE

Tried exporting:

# export LANG=en_US.UTF-8
# export LC_ALL=en_US.UTF-8
bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)

Python3 CLI:

>>> f = open('/root/qa/data/simple_questions_wikidata/train.tsv', 'r')
>>> [line.decode('utf-8') for line in f]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <listcomp>
  File "/usr/lib/python3.5/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 573: ordinal not in range(128)

Torchtext:

Traceback (most recent call last):
  File "example/tune.py", line 268, in <module>
    main()
  File "example/tune.py", line 208, in main
    dev_examples, train_examples = load_examples(options)
  File "/root/pytorch-seq2seq/example/lib/utils.py", line 222, in load_examples
    fields=[('input', input_field), ('output', output_field)])
  File "/usr/local/lib/python3.5/dist-packages/torchtext/data/dataset.py", line 56, in splits
    train_data = None if train is None else cls(path + train, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/torchtext/data/dataset.py", line 107, in __init__
    for line in f]
  File "/usr/local/lib/python3.5/dist-packages/torchtext/data/dataset.py", line 106, in <listcomp>
    make_example(line.decode('utf-8') if six.PY2 else line, fields)
  File "/usr/lib/python3.5/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]

The default ubuntu docker image doesn't have en-US.UTF-8. That's the warning you're getting when exporting. Try:

RUN apt-get update --fix-missing && apt-get install locales
RUN locale-gen en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LC_ALL en_US.UTF-8

bump; did this end up working / should we close this?

On another note, i've never been sure how to write code that works with unicode for py2/3...much of it hinges on the fact that py3 is assumed to use unicode by default, but this isn't necessarily always true. Should we be refactoring things to check default encoding instead, or (probably more sane) have something in the README about properly setting locales for py3 to use unicode by default?

This ended up working!

Sorry to bump this, but I've run into the same problem even though on my machine (Red Hat 6.9) I have the LANG and LC_ALL variables set to en_US.UTF-8. I think part of it might be that I'm trying to use Python 3 to load models that were saved with Python 2.

The default ubuntu docker image doesn't have en-US.UTF-8. That's the warning you're getting when exporting. Try:

RUN apt-get update --fix-missing && apt-get locales
RUN locale-gen en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LC_ALL en_US.UTF-8

@nelson-liu
Minor but can be helpful
Can you edit apt-get locales to apt-get install locales?

apt-get update --fix-missing && apt-get install locales

@nelson-liu Thanks for your solution. I encountered a similar issue and solved it with your suggestion.