mlcommons/training_results_v0.6

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3

Opened this issue · 1 comments

Hi All,

Problem with dataset or code ? Thanks for any hints.

Run:
training_results_v0.6/NVIDIA/benchmarks/gnmt/implementations/download_dataset.sh

Error:
Input sentences: 4562102 Output sentences: 4524868
Cleaning data/train.tok...
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
LANGUAGE = (unset),
LC_ALL = "C.UTF-8",
LANG = "C.UTF-8"
are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
clean-corpus.perl: processing data/train.tok.de & .en to data/train.tok.clean, cutoff 1-80, ratio 9
..........(100000)..........(200000)..........(300000)..........(400000)..........(500000)..........(600000)..........(700000)..........(800000)..........(900000)..........(1000000)..........(1100000)..........(1200000)..........(1300000)..........(1400000)..........(1500000)..........(1600000)..........(1700000)..........(1800000)..........(1900000)..........(2000000)..........(2100000)..........(2200000)..........(2300000)..........(2400000)..........(2500000)..........(2600000)..........(2700000)..........(2800000)..........(2900000)..........(3000000)..........(3100000)..........(3200000)..........(3300000)..........(3400000)..........(3500000)..........(3600000)..........(3700000)..........(3800000)..........(3900000)..........(4000000)..........(4100000)..........(4200000)..........(4300000)..........(4400000)..........(4500000)......
Input sentences: 4562102 Output sentences: 4500966
Traceback (most recent call last):
File "pytorch/scripts/filter_dataset.py", line 79, in
main()
File "pytorch/scripts/filter_dataset.py", line 55, in main
for idx, lines in enumerate(zip(f1, f2)):
File "/usr/lib64/python3.6/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 6727: ordinal not in range(128)

Are these variables a part of your environment?

export LANG=C.UTF-8 
export LC_ALL=C.UTF-8