wdecoster/nanostat

Lengths of sequence and quality values differs

sgl07007 opened this issue · 3 comments

I am in the process of re-basecalling all of my minION data with the new version of Guppy (v2.3.1) and I am getting the following error with one of my fastQ files:

NanoStat --fastq JKH170_guppy_2.3.1.fastq
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/process.py", line 232, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/process.py", line 191, in _process_chunk
return [fn(*args) for args in chunk]
File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/process.py", line 191, in
return [fn(*args) for args in chunk]
File "/usr/local/lib/python3.7/site-packages/nanoget/nanoget.py", line 424, in process_fastq_plain
data=[res for res in extract_from_fastq(inputfastq) if res],
File "/usr/local/lib/python3.7/site-packages/nanoget/nanoget.py", line 424, in
data=[res for res in extract_from_fastq(inputfastq) if res],
File "/usr/local/lib/python3.7/site-packages/nanoget/nanoget.py", line 434, in extract_from_fastq
for rec in SeqIO.parse(fq, "fastq"):
File "/usr/local/lib/python3.7/site-packages/Bio/SeqIO/init.py", line 655, in parse
for r in i:
File "/usr/local/lib/python3.7/site-packages/Bio/SeqIO/QualityIO.py", line 1029, in FastqPhredIterator
for title_line, seq_string, quality_string in FastqGeneralIterator(handle):
File "/usr/local/lib/python3.7/site-packages/Bio/SeqIO/QualityIO.py", line 950, in FastqGeneralIterator
% (title_line, seq_len, len(quality_string)))
ValueError: Lengths of sequence and quality values differs for 3da56ce1-f863-451d-9e5a-eaf696bdd766 runid=4dde5071c9a476b40b7acefbe968f5ed1392ca60 sampleid=Twelve_Genome_RUN_SLG_3_30_18 read=59475 ch=48 start_time=2018-03-31T14:09:47Z barcode=barcode06 (330 and 10951).
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/local/bin/NanoStat", line 11, in
sys.exit(main())
File "/usr/local/lib/python3.7/site-packages/nanostat/NanoStat.py", line 74, in main
barcoded=args.barcoded)
File "/usr/local/lib/python3.7/site-packages/nanoget/nanoget.py", line 74, in get_input
dfs=[out for out in executor.map(extration_function, files)],
File "/usr/local/lib/python3.7/site-packages/nanoget/nanoget.py", line 74, in
dfs=[out for out in executor.map(extration_function, files)],
File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/process.py", line 476, in _chain_from_iterable_of_lists
for element in iterable:
File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py", line 586, in result_iterator
yield fs.pop().result()
File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py", line 432, in result
return self.__get_result()
File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
ValueError: Lengths of sequence and quality values differs for 3da56ce1-f863-451d-9e5a-eaf696bdd766 runid=4dde5071c9a476b40b7acefbe968f5ed1392ca60 sampleid=Twelve_Genome_RUN_SLG_3_30_18 read=59475 ch=48 start_time=2018-03-31T14:09:47Z barcode=barcode06 (330 and 10951)

I have gotten this error 10+ times with this fastQ, and having been removing reads that are specified as "problems" (also tried redoing the fastq concatenation, compression, and download) but I keep getting the same error with a different read specified. This was a minION run with 12 barcodes and the rest of the samples run just fine. Any suggestions for how to overcome this issue?

This error is raised by Biopython, the module which I use to parse fastq files. It says your fastq file is corrupted/malformed. I'd suggest taking a look at that read in your fastq file with

grep -C 10 3da56ce1-f863-451d-9e5a-eaf696bdd766 reads.fastq
or if using a compressed file:
zgrep -C 10 3da56ce1-f863-451d-9e5a-eaf696bdd766 reads.fastq.gz

Using -C 10 also neighboring lines will be shown. Inspecting the records should make clear that something is off, presumably when guppy concatenated the reads. You are not the only one with this issue.

Thank you! Turns out an entire guppy fastq (default 4000 reads per fastq) was corrupted.

I expected as much, the best solution is probably to repeat the basecalling...

Let me know if you encounter other issues.