wdecoster/nanostat

Error UnicodeDecode utf-8

stanislasmorand opened this issue · 7 comments

Dear Wouter De Coster,
I ran your tool on a fastq file generated by a MinION device + MinKNOW software + Guppy basecaller.
Unfortunately, I received the error message below.
Do you have an idea of what went wrong? I imagine that perhaps the quality text line associated w/ a nucleotide sequence used inapropriate symbols that were not recognized by the NanoStat tool but that seems unlikely since the quality symbols are well nomenclatured/defined.
Looking forward to reading your feedback and suggestions.
Kindest regards,
Stan

[12:03] morands@frrdcim20: test_nanostat $ NanoStat --fastq ./0-50.fastq
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/local/python/3.7.4/lib/python3.7/concurrent/futures/process.py", line 239, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File "/usr/local/python/3.7.4/lib/python3.7/concurrent/futures/process.py", line 198, in _process_chunk
return [fn(*args) for args in chunk]
File "/usr/local/python/3.7.4/lib/python3.7/concurrent/futures/process.py", line 198, in
return [fn(*args) for args in chunk]
File "/home/morands/LIBRARIES/nanoget-1.15.0/lib/python3.7/site-packages/nanoget-1.15.0-py3.7.egg/nanoget/extraction_functions.py", line 321, in process_fastq_plain
data=[res for res in extract_from_fastq(inputfastq) if res],
File "/home/morands/LIBRARIES/nanoget-1.15.0/lib/python3.7/site-packages/nanoget-1.15.0-py3.7.egg/nanoget/extraction_functions.py", line 321, in
data=[res for res in extract_from_fastq(inputfastq) if res],
File "/home/morands/LIBRARIES/nanoget-1.15.0/lib/python3.7/site-packages/nanoget-1.15.0-py3.7.egg/nanoget/extraction_functions.py", line 331, in extract_from_fastq
for rec in SeqIO.parse(fq, "fastq"):
File "/home/morands/LIBRARIES/biopython-1.76/lib/python3.7/site-packages/Bio/SeqIO/QualityIO.py", line 1055, in FastqPhredIterator
for title_line, seq_string, quality_string in FastqGeneralIterator(handle):
File "/home/morands/LIBRARIES/biopython-1.76/lib/python3.7/site-packages/Bio/SeqIO/QualityIO.py", line 956, in FastqGeneralIterator
line = handle_readline()
File "/usr/local/python/3.7.4/lib/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 819: invalid start byte
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/morands/NANOSTAT/nanostat-1.4.0/bin/NanoStat", line 11, in
load_entry_point('NanoStat==1.4.0', 'console_scripts', 'NanoStat')()
File "/home/morands/NANOSTAT/nanostat-1.4.0/lib/python3.7/site-packages/NanoStat-1.4.0-py3.7.egg/nanostat/NanoStat.py", line 85, in main
File "/home/morands/LIBRARIES/nanoget-1.15.0/lib/python3.7/site-packages/nanoget-1.15.0-py3.7.egg/nanoget/nanoget.py", line 92, in get_input
File "/home/morands/LIBRARIES/nanoget-1.15.0/lib/python3.7/site-packages/nanoget-1.15.0-py3.7.egg/nanoget/nanoget.py", line 92, in
File "/usr/local/python/3.7.4/lib/python3.7/concurrent/futures/process.py", line 483, in _chain_from_iterable_of_lists
for element in iterable:
File "/usr/local/python/3.7.4/lib/python3.7/concurrent/futures/_base.py", line 598, in result_iterator
yield fs.pop().result()
File "/usr/local/python/3.7.4/lib/python3.7/concurrent/futures/_base.py", line 435, in result
return self.__get_result()
File "/usr/local/python/3.7.4/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 819: invalid start byte

Is the data by chance actually gzipped? What is the output of

file ./0-50.fastq

Hmm CRLF line terminators, did you open or manipulate the file using a windows program?

Please try dos2unix ./0-50.fastq

Dear Wouter,
The error appeared again with a different position (168 versus 819).
Kindest regards,
Stan

[14:02] morands@frrdcim20: test_nanostat $ NanoStat --fastq ./0-50.fastq
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/local/python/3.7.4/lib/python3.7/concurrent/futures/process.py", line 239, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File "/usr/local/python/3.7.4/lib/python3.7/concurrent/futures/process.py", line 198, in _process_chunk
return [fn(*args) for args in chunk]
File "/usr/local/python/3.7.4/lib/python3.7/concurrent/futures/process.py", line 198, in
return [fn(*args) for args in chunk]
File "/home/morands/LIBRARIES/nanoget-1.15.0/lib/python3.7/site-packages/nanoget-1.15.0-py3.7.egg/nanoget/extraction_functions.py", line 321, in process_fastq_plain
data=[res for res in extract_from_fastq(inputfastq) if res],
File "/home/morands/LIBRARIES/nanoget-1.15.0/lib/python3.7/site-packages/nanoget-1.15.0-py3.7.egg/nanoget/extraction_functions.py", line 321, in
data=[res for res in extract_from_fastq(inputfastq) if res],
File "/home/morands/LIBRARIES/nanoget-1.15.0/lib/python3.7/site-packages/nanoget-1.15.0-py3.7.egg/nanoget/extraction_functions.py", line 331, in extract_from_fastq
for rec in SeqIO.parse(fq, "fastq"):
File "/home/morands/LIBRARIES/biopython-1.76/lib/python3.7/site-packages/Bio/SeqIO/QualityIO.py", line 1055, in FastqPhredIterator
for title_line, seq_string, quality_string in FastqGeneralIterator(handle):
File "/home/morands/LIBRARIES/biopython-1.76/lib/python3.7/site-packages/Bio/SeqIO/QualityIO.py", line 956, in FastqGeneralIterator
line = handle_readline()
File "/usr/local/python/3.7.4/lib/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 168: invalid start byte
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/morands/NANOSTAT/nanostat-1.4.0/bin/NanoStat", line 11, in
load_entry_point('NanoStat==1.4.0', 'console_scripts', 'NanoStat')()
File "/home/morands/NANOSTAT/nanostat-1.4.0/lib/python3.7/site-packages/NanoStat-1.4.0-py3.7.egg/nanostat/NanoStat.py", line 85, in main
File "/home/morands/LIBRARIES/nanoget-1.15.0/lib/python3.7/site-packages/nanoget-1.15.0-py3.7.egg/nanoget/nanoget.py", line 92, in get_input
File "/home/morands/LIBRARIES/nanoget-1.15.0/lib/python3.7/site-packages/nanoget-1.15.0-py3.7.egg/nanoget/nanoget.py", line 92, in
File "/usr/local/python/3.7.4/lib/python3.7/concurrent/futures/process.py", line 483, in _chain_from_iterable_of_lists
for element in iterable:
File "/usr/local/python/3.7.4/lib/python3.7/concurrent/futures/_base.py", line 598, in result_iterator
yield fs.pop().result()
File "/usr/local/python/3.7.4/lib/python3.7/concurrent/futures/_base.py", line 435, in result
return self.__get_result()
File "/usr/local/python/3.7.4/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 168: invalid start byte

What exactly did you do to those files?

Dear Wouter,
My initial 186 fastq files were generated on a Windows-operated computer (MinKNOW software + Guppy basecaller). Files were then transfered onto a Linux server. I decided to NanoStat them one by one to identify which one(s) was/were compromised, and found 2 problematic fastq files. Following your suspicion on a Dos-to-Unix conversion issue, I transformed those 2 fastq files via the command iconv -f iso-8859-1 -t utf8. For one file, the conversion solved the issue (passed the NanoStat analysis). For the other file, there was still an error through NanoStat, the suspicious read was identified in the error description, so I was able to erase it; and the new fastq file passed the NanoStat test.
Thanks again for your suggestion about an issue related to Dos & Unix format.
Kindest regards,
Stan
unconventional_quality_symbols_fastq_file