drivenbyentropy/aptasuite

AptaSuite parsing crush - invalid alphabet - help wanted

User-89 opened this issue · 1 comments

Hi,
I have been using aptasuite-0.9.4 for over a year now. I used to import multiplexed or demultiplexed data from NGS sequencing as fastq files. However, now I encountered a problem during parsing, which manifest itself as follows:

  1. While importing, all data concerning uploaded fastq file (total processed reads, accepted reads, invalid alphabet, (...), invalid cycle) is continuously displayed and looks OK
    2. Then, all at sudden the count for "invalid alphabet" kind of jumps from a number 100-1000 to over 200 000 up to 2 millions!
    3. Then, right after that happens the program crushes and displays the communicate:

Reading configuration from file.
Instantiating MapDBAptamerPool
Processing selection cycle R5s2
Loading took 18743 milliseconds
Exception in thread "pool-2-thread-1" java.lang.NullPointerException
at lib.parser.aptaplex.FastqReader.getNextRead(FastqReader.java:128)
at lib.parser.aptaplex.AptaPlexProducer.run(AptaPlexProducer.java:180)
at java.lang.Thread.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)

or

Reading configuration from file.
Instantiating MapDBAptamerPool
Processing selection cycle R4s
Loading took 19226 milliseconds
Exception in thread "pool-2-thread-1" java.lang.NullPointerException
at lib.parser.aptaplex.FastqReader.getNextRead(FastqReader.java:126)
at lib.parser.aptaplex.AptaPlexProducer.run(AptaPlexProducer.java:180)
at java.lang.Thread.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)

It happened so far in 10 out of 13 demultiplexed files and in 2 out of 2 multiplexed files.
I checked these files (especially around the lines where AptaSuit encounters problem) and they seem correct and for me it looks a bit like AptaSuite parser "jumps over" the sequence line at some point and starts importing other lines (seq identifier/quality score?) as sequence lines hence suddenly so many invalid alphabet counts (out of ACTG range).

Please, let me know if you have any idea what could have happened and how to resolve this problem.
Raw data uploaded to AptaSuite are always pre-prepared on UseGalaxy platform, using the same workflow as always, however some tools were updated over time, although I do not think it would have an influence on the correctness of fastq data imported to AptaSuite. On Galaxy platform all fastq files look correct and have been checked using FastQC tool as well as line by line, around the problematic lines, when AptaSuite crushes.

I attached logs for these two samples I mentioned above and screens of how it looks like when parser crushes
log_2020-04-06_18-18-53.txt
log_2020-04-06_16-35-21.txt
R4s-error_png
Rs2-error_png

Thank you for the detailed report. I do agree that his could be related to a line skipping in the files. However, this appears to be more related to the input data than to AptaSuite.

Could you please verify the validity of the files with a FastQ validator (a tool that checks line by line for correctness). FastQValidator comes to mind here.

Thank you and please let me know what you find.