mikessh/mageri

Error while running with UMI

VNagesh-Bio opened this issue · 6 comments

Hey Mike,

I am getting the following exception when running MAGERI:

[Fri Aug 11 12:22:47 EDT 2017 +00m00s] [my_project] Started analysis.
[Fri Aug 11 12:22:47 EDT 2017 +00m00s] [my_project] Pre-processing sample group my_sample.
[Fri Aug 11 12:22:47 EDT 2017 +00m00s] [Indexer] Building UMI index, 0 reads processed, 0.0% extracted..
Exception in thread "main" java.lang.RuntimeException: Error while parsing quality
at com.milaboratory.core.sequencing.io.fastq.SFastqReader.parse(SFastqReader.java:258)
at com.milaboratory.core.sequencing.io.fastq.PFastqReader.take(PFastqReader.java:213)
at com.antigenomics.mageri.core.input.PMigReader$PairedReaderWrapper.take(PMigReader.java:182)
at com.antigenomics.mageri.core.input.PMigReader$PairedReaderWrapper.take(PMigReader.java:166)
at cc.redberry.pipe.blocks.O2ITransmitter.run(O2ITransmitter.java:66)
at java.lang.Thread.run(Thread.java:744)
Caused by: com.milaboratory.core.sequence.quality.WrongQualityStringException: [-1]
at com.milaboratory.core.sequence.quality.SequenceQualityPhred.parse(SequenceQualityPhred.java:130)
at com.milaboratory.core.sequencing.io.fastq.SFastqReader.parse(SFastqReader.java:256)
... 5 more

I am running this on paired end reads.

My Read 1 looks like this:
@NB501788:53:H3HYMBGX3:1:11101:14803:1046 1:N:0:TACAGGTC+NATGCTGG UMI:GATCAC:CCCFFD
CTCCTNAGATACTGTTATCGTGCAGCGCNNNNNNNNNNNNNNNNNNNNNNNNNNTTAAAGAAATATGCA
+
AAAAA#EEEEEEEEEEEEEEEEEAEEEE##########################AEEEEEEEEEEEEEE
@NB501788:53:H3HYMBGX3:1:11101:11212:1046 1:N:0:TACAGGTC+NATGCTGG UMI:TGGAAT:CCCFFD
GCCGTNATGCAGTAGCAGCGAGGCATTCNNNNNNNNNNNNNNNNNNNNNNNNNNGGCTACTTCTTATACT
+
AAA/A#EEE/EEEEEEEE/EEEEEEA/E##########################EEEEEEEAAEEEEEEE

And Read 2 looks similar but with 2:N:0:....

Any help in this matter will be greatly appreciated.

Thanks,
Vaishnavi

Hello Vaishnavi,

What quality format (and what sequencing platform) are you using? Have your tried to convert your quality strings into a default Illumina Phred format?

Hi Mike,
Usine Illumina. I noticed there were additional spaces at the end of the lines that was causing the issue. I am able to run it through MAGERI now.

A couple of questions I had:

  1. Does MAGERI index the genome everytime? Can I provide with a fai file instead of the fasta?
  2. Also, when I run it through MAGERI, my assembled files seems to have no enteries:

[Fri Aug 11 15:14:12 EDT 2017 +07m55s] [Indexer] Finished building UMI index, 24638979 reads processed, 100.0% extracted
[Fri Aug 11 15:14:12 EDT 2017 +07m55s] [my_project] Running analysis for sample group my_sample.
[Fri Aug 11 15:14:12 EDT 2017 +07m55s] [my_project.my_sample] Assembling & aligning consensuses, 0 MIGs processed..
[Fri Aug 11 15:14:30 EDT 2017 +08m14s] [my_project.my_sample] Assembling & aligning consensuses, 265 MIGs processed..
[Fri Aug 11 15:14:41 EDT 2017 +08m24s] [my_project.my_sample] Assembling & aligning consensuses, 1369 MIGs processed..
[Fri Aug 11 15:14:51 EDT 2017 +08m34s] [my_project.my_sample] Assembling & aligning consensuses, 3461 MIGs processed..
[Fri Aug 11 15:14:53 EDT 2017 +08m37s] [my_project.my_sample] Finished, 4095 MIGs processed in total.
[Fri Aug 11 15:14:53 EDT 2017 +08m37s] [my_project.my_sample] Calling variants.
[Fri Aug 11 15:14:53 EDT 2017 +08m37s] [my_project.my_sample] Finished
[Fri Aug 11 15:14:53 EDT 2017 +08m37s] [my_project] Done.
[Fri Aug 11 15:14:53 EDT 2017 +08m37s] [my_project] Writing output.
[Fri Aug 11 15:14:53 EDT 2017 +08m37s] [my_project] Done.

-rw-rw-r--. 1 vnagesh vnagesh 20 Aug 11 15:14 my_project.my_sample.assemble.R1.fastq.gz
-rw-rw-r--. 1 vnagesh vnagesh 20 Aug 11 15:14 my_project.my_sample.assemble.R2.fastq.gz

Not sure what is wrong here

Hello,

MAGERI does not index the genome, instead a set reference fasta files should be provided in a specific format (see docs).

Its quite strange that the assembled consensus files are empty while the pipeline says that there were 4095 MIGs processed -- can you please provide MAGERI logs/reports?

Hi @mikessh ,

Please find the attached log.
Not sure what other report I can help you with.
The command I used to run MAGERI:
java -Xmx64G -jar ~/Downloads/mageri/mageri.jar -M4 --references refs.fa -R1 Read1.fastq.gz -R2 Read2.fastq.gz > mageri.log
Regards,
Vaishnavi
mageri_log.txt
my_project.assemble.txt

Hi @mikessh
Did you get a chance to look into this issue?

Hi, yes, but forgot to answer here :)

So what the assemble log is telling is that all reads were dropped because of errors (reads.mismatch.r1/reads.mismatch.r1). MAGERI is iteratively assembling consensus sequences by looking for a "core" region (say the most frequent combination of central 16 bases of the read, with +/- 4 base offset allowed). Then all reads are aligned to the core region. In case a read contains too many mismatches (> 4 in 16 bases), the read is dropped.

I can suggest that there is something wrong with the way you define UMI positions, as there are on average 6000 reads per UMI which is a lot (typical coverage should be 30-100 reads). In case you have defined wrong region for UMI extraction, it is possible that unrelated reads will get in the same UMI group (aka MIG), and will not match the "core" sequence. If you can manually select a UMI sequence and then perform a lookup of you FASTQ file for this UMI perhaps you can find out where the problem is.

Alternatively it is possible to change assembler parameters (increase "core" region size and the number of allowed errors by modifying MAGERI parameters in XML config (see readme).

I can't tell much more without looking into the raw data / experiment design.