broadinstitute/Drop-seq

ReduceGTF Fails for RefSeq GTF

Closed this issue · 4 comments

Hello,

Recently, our lab tried to create a drop-seq-tools-compatible genome from an old RefSeq version of the Macaca fascicularis GFF. We converted the GFF to GTF using gffread 0.11.6 (latest version, command). Then, on Terra, we launched the Cumulus Team’s dropseq_bundle WDL script which uses drop-seq_tools 2.3.0, and by association, what I think is Picard 2.18.14. The workflow fails on the Reduce_GTF step with the following error message:

…
INFO 2020-01-09 14:48:49 EnhanceGTFRecords enhancing 5,700 genes. Elapsed time: 00:00:18s. Time for last 100: 0s. Last read position: LOC102124659:0
INFO 2020-01-09 14:48:49 EnhanceGTFRecords enhancing 5,800 genes. Elapsed time: 00:00:18s. Time for last 100: 0s. Last read position: CDK13:0
[Thu Jan 09 14:48:49 UTC 2020] org.broadinstitute.dropseqrna.annotation.ReduceGtf done. Elapsed time: 0.32 minutes.
Runtime.totalMemory()=693784576
Exception in thread "main" java.lang.IllegalArgumentException: Illegal Capacity: -1
at java.util.ArrayList.<init>(ArrayList.java:157)
at org.broadinstitute.dropseqrna.annotation.EnhanceGTFRecords.getIntronIntervals(EnhanceGTFRecords.java:162)
at org.broadinstitute.dropseqrna.annotation.EnhanceGTFRecords.enhanceTranscript(EnhanceGTFRecords.java:124)
at org.broadinstitute.dropseqrna.annotation.EnhanceGTFRecords.enhanceGene(EnhanceGTFRecords.java:75)
at org.broadinstitute.dropseqrna.annotation.EnhanceGTFRecords.enhanceGTFRecords(EnhanceGTFRecords.java:61)
at org.broadinstitute.dropseqrna.annotation.ReduceGtf.doWork(ReduceGtf.java:115)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:295)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
at org.broadinstitute.dropseqrna.cmdline.DropSeqMain.main(DropSeqMain.java:42)
…

We have also tried the newest RefSeq GTF, with which we somehow had even less success as crashed during the ConvertToRefFlat step as well. Both times we have tried the workflow, the star_index workflow task was perpetually stuck after it reported it had finished successfully. We will consult the workflow developer on that issue, though.

As for the main concern, Reduce_Gtf, it seems like it’s a Drop-Seq tools script that extends Picard, but I am not sure whether the script or Picard is at fault. As a result, I am posting this ticket on broadinstitute
/Drop-seq and the GATK Community Forum. I am attaching the log below, as well as the first 20 lines of the GTF. Help or advice would be greatly appreciated!

head.GCF_000364345.1_Macaca_fascicularis_5.0_genomic.txt
reduce_gtf.log

Thanks,
James
Shalek Lab @ MIT / Broad

alecw commented

Hi @jggatter ,

It looks like you have a gene with no exons. If you run again with VERBOSITY=DEBUG, you'll get a message for each gene being enhanced, and you can look at the last such message emitted before the exception in order to figure out which gene is the problem. You'd then need to fix the GTF somehow -- either add an exon or remove the gene entirely.

Regards, Alec

Hi @alecw !

Thanks for the quick reply! Sorry I couldn't get back to you sooner, we were slow to edit the GTF. We found about 13 transcripts without exons, so we changed each one's CDS to exon. Unfortunately, we are still getting the same error at a later gene.

...
INFO 2020-01-21 16:16:47 EnhanceGTFRecords enhancing 9,700 genes. Elapsed time: 00:00:21s. Time for last 100: 0s. Last read position: TUBA3E:0
INFO 2020-01-21 16:16:47 EnhanceGTFRecords enhancing 9,800 genes. Elapsed time: 00:00:21s. Time for last 100: 0s. Last read position: LOC102137047:0
[Tue Jan 21 16:16:47 UTC 2020] org.broadinstitute.dropseqrna.annotation.ReduceGtf done. Elapsed time: 0.37 minutes.
Runtime.totalMemory()=694054912
Exception in thread "main" java.lang.IllegalArgumentException: Illegal Capacity: -1
at java.util.ArrayList.<init>(ArrayList.java:157)
at org.broadinstitute.dropseqrna.annotation.EnhanceGTFRecords.getIntronIntervals(EnhanceGTFRecords.java:162)
at org.broadinstitute.dropseqrna.annotation.EnhanceGTFRecords.enhanceTranscript(EnhanceGTFRecords.java:124)
at org.broadinstitute.dropseqrna.annotation.EnhanceGTFRecords.enhanceGene(EnhanceGTFRecords.java:75)
at org.broadinstitute.dropseqrna.annotation.EnhanceGTFRecords.enhanceGTFRecords(EnhanceGTFRecords.java:61)
at org.broadinstitute.dropseqrna.annotation.ReduceGtf.doWork(ReduceGtf.java:115)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:295)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
at org.broadinstitute.dropseqrna.cmdline.DropSeqMain.main(DropSeqMain.java:42)
2020/01/21 16:16:49 Starting delocalization.
...

I'm rechecking for this same problem, it's possible I missed some. Please keep this issue open in the meantime!

And also thanks, if I can't find anything this time I think I'll take your advice about cloning the repo and running the script locally with debug verbosity enabled.

I just ran a script I made to check the whole GTF for this problem. It revealed no exonless transcripts so I am 100% confident that every transcript has an exon.

VERBOSITY=DEBUG however revealed that the program was crashing on exons of mitochondrial genes at the end of the file that lack a gene_name in the last column. We copied the last column for each of the transcripts and pasted it as the exons'. It finished successfully! Thanks for all your help! I'll close this provided that we get through the rest of the process okay!