NullPointerException when running TagReadWithGeneFunction, Drop-seq protocol
hoytpr opened this issue · 9 comments
I'm running Drop-seq_tools2.3.0 or our University cluster and have some scRNAseq data from our Illumina instrument. Drop-seq was installed in the last few days. Following the McCarroll lab protocol from 2018.
I have completed the create_Drop-seq_reference_metadata.sh
script, and am running the Drop-seq_alignment.sh
script on a small batch of sequence data (one sample with only a few thousand reads). These sequences were processed with bcl2fastq
(most recent version) and then a new install of Picard 2.21.4-3gd7e09b8 (installed by cloning and building with gradle) using FastqToSam
to generate the uBAM files. Unfortunately the NullPointerException
doesn't give me much to troubleshoot. I have read other issues on this site, but none seemed to solve this exact error.
EDIT: I did not include that this cluster loads JAVA 1.8 using a module so module load jdk/1.8.0_112
must be part of my submission scripts.
TagReadWithInterval completes then during the TagReadWithGeneFunction step of the process, the following error occurs:
[Tue Dec 17 15:04:36 CST 2019] org.broadinstitute.dropseqrna.metrics.TagReadWithInterval done. Elapsed time: 0.02 minutes.
Runtime.totalMemory()=504889344
INFO 2019-12-17 15:04:37 TagReadWithGeneFunction
********** NOTE: Picard's command line syntax is changing.
**********
********** For more information, please see:
********** https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)
**********
********** The command line looks like this in the new syntax:
**********
********** TagReadWithGeneFunction -O ./dropseqtemp/function_tagged.bam -ANNOTATIONS_FILE ../outputs/cat9.refFlat -INPUT ./dropseqtemp/gene_tagged.bam
**********
15:04:37.696 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/panfs/panfs.cluster/scratch/phoyt/scRNA/Drop-seq_tools-2.3.0/jar/lib/picard-2.18.14.jar!/com/intel/gkl/native/libgkl_compression.so
15:04:37.708 WARN NativeLibraryLoader - Unable to load libgkl_compression.so from native/libgkl_compression.so (No such file or directory)
15:04:37.710 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/panfs/panfs.cluster/scratch/phoyt/scRNA/Drop-seq_tools-2.3.0/jar/lib/picard-2.18.14.jar!/com/intel/gkl/native/libgkl_compression.so
15:04:37.710 WARN NativeLibraryLoader - Unable to load libgkl_compression.so from native/libgkl_compression.so (No such file or directory)
[Tue Dec 17 15:04:37 CST 2019] TagReadWithGeneFunction INPUT=./dropseqtemp/gene_tagged.bam OUTPUT=./dropseqtemp/function_tagged.bam ANNOTATIONS_FILE=../outputs/cat9.refFlat GENE_NAME_TAG=gn GENE_STRAND_TAG=gs GENE_FUNCTION_TAG=gf READ_FUNCTION_TAG=XF USE_STRAND_INFO=true VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Tue Dec 17 15:04:37 CST 2019] Executing as phoyt@n252.cluster on Linux 2.6.32-279.5.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_112-b15; Deflater: Jdk; Inflater: Jdk; Provider GCS is not available; Picard version: 2.3.0(34e6572_1555443285)
15:04:37.735 WARN IntelDeflaterFactory - IntelInflater is not supported, using Java.util.zip.Inflater
[Tue Dec 17 15:04:38 CST 2019] org.broadinstitute.dropseqrna.metrics.TagReadWithGeneFunction done. Elapsed time: 0.02 minutes.
Runtime.totalMemory()=504889344
Exception in thread "main" java.lang.NullPointerException
at picard.annotation.RefFlatReader.makeTranscriptFromRefFlatLine(RefFlatReader.java:161)
at picard.annotation.RefFlatReader.makeGeneFromRefFlatLines(RefFlatReader.java:147)
at picard.annotation.RefFlatReader.load(RefFlatReader.java:104)
at picard.annotation.RefFlatReader.load(RefFlatReader.java:66)
at picard.annotation.GeneAnnotationReader.loadRefFlat(GeneAnnotationReader.java:37)
at org.broadinstitute.dropseqrna.annotation.GeneAnnotationReader.loadRefFlat(GeneAnnotationReader.java:72)
at org.broadinstitute.dropseqrna.annotation.GeneAnnotationReader.loadAnnotationsFile(GeneAnnotationReader.java:53)
at org.broadinstitute.dropseqrna.metrics.TagReadWithGeneFunction.doWork(TagReadWithGeneFunction.java:113)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:295)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
at org.broadinstitute.dropseqrna.cmdline.DropSeqMain.main(DropSeqMain.java:42)
Steps to reproduce
I do not have additional files to reproduce at this time, but will gladly run larger files if it will help. While inexperienced in this process, the GTF
and reduced.GTF
files look correctly formatted however I'm not very familiar with the .refFlat
format:
(top of) GTF FILE:
NC_001700.1 RefSeq CDS 3615 4570 . + 0 transcript_id "gene35575"; transcript_name "gene35575"; gene_id "gene35575"; gene_name "ND1";
NC_001700.1 RefSeq CDS 4781 5822 . + 0 transcript_id "gene35576"; transcript_name "gene35576"; gene_id "gene35576"; gene_name "ND2";
NC_001700.1 RefSeq CDS 6216 7760 . + 0 transcript_id "gene35577"; transcript_name "gene35577"; gene_id "gene35577"; gene_name "COX1";
(top of) reduced.GTF file:
chr start end strand gene_name gene_id transcript_name transcript_id transcriptType annotationType
NC_018723.3 79442 79757 + LOC102899061 gene0 rna0 rna0 NA exon
NC_018723.3 79442 84501 + LOC102899061 gene0 NA NA NA gene
NC_018723.3 79442 84501 + LOC102899061 gene0 rna0 rna0 NA transcript
(top of) refFlat file:
LOC109496728 rna5513 NC_018724.3 + 6794351 6798252 6794351 6798252 2 6794351,6797801, 6794446,6798252,
NPPB rna32694 NC_018730.3 - 8677085 8678561 8677263 8678472 3 8677085,8677854,8678346, 8677280,8678110,8678561,
PTF1A rna28013 NC_018729.3 + 21784047 21785957 21784222 21785608 4 21784047,21784664,21784917,21785405, 21784663,21784916,21785011,21785957,
LOC109495981 rna68304 NC_018740.3 + 10478959 10484355 10478959 10484355 3 10478959,10483664,10484231, 10479105,10483862,10484355,
Any advice would be greatly appreciated.
Hi Pete,
Is your refFlat file tab-separated? It doesn't look that way, but that could be github reformatting. One thing you can do to narrow down the problem is to run ConvertToRefFlat on your refFlat file. I know that is counter-intuitive, but if input is refFlat, it uses the same code to read the file. You can then split the refFlat in half, and run ConvertToRefFlat separately on each half to see which has the problem. Keep doing that until you've got a smallish refFlat that causes the problem, and then attach that file, and also your sequence dictionary, and then I can debug.
Regards, Alec
Hi Alec and thanks. The refFlat file is tab-delimited. I'm working on the ConvertToRefFlat splitting now.
I went through my history and noticed the first run through ConvertToRefFlat I had no problems with the GTF parser, but the GTFReader gave multiple "INFO" outputs of "Chromosome Disagreement" that appeared to be almost all tRNAs (607 lines total) which were "skipped". These did not cause any errors or warnings later, so I had ignored them based on researching others who had similar issues. If this is helpful, I apologize for not mentioning it earlier.
[Mon Dec 16 17:56:00 CST 2019] FilterGtf GTF=../picard/cat9.gtf OUTPUT=../outputs/cat9.gtf SEQUENCE_DICTIONARY=../outputs/cat9.dict VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Mon Dec 16 17:56:00 CST 2019] Executing as phoyt@gpu01.cluster on Linux 2.6.32-279.5.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_112-b15; Deflater: Jdk; Inflater: Jdk; Provider GCS is not available; Picard version: 2.3.0(34e6572_1555443285)
[Mon Dec 16 17:56:13 CST 2019] org.broadinstitute.dropseqrna.annotation.FilterGtf done. Elapsed time: 0.22 minutes.
Runtime.totalMemory()=2849505280
INFO 2019-12-16 17:56:14 ConvertToRefFlat
********** NOTE: Picard's command line syntax is changing.
**********
********** For more information, please see:
********** https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)
**********
********** The command line looks like this in the new syntax:
**********
********** ConvertToRefFlat -ANNOTATIONS_FILE ../outputs/cat9.gtf -SEQUENCE_DICTIONARY ../outputs/cat9.dict -OUTPUT ../outputs/cat9.refFlat
**********
17:56:14.807 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/panfs/panfs.cluster/scratch/phoyt/scRNA/Drop-seq_tools-2.3.0/jar/lib/picard-2.18.14.jar!/com/intel/gkl/native/libgkl_compression.so
17:56:14.819 WARN NativeLibraryLoader - Unable to load libgkl_compression.so from native/libgkl_compression.so (No such file or directory)
17:56:14.820 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/panfs/panfs.cluster/scratch/phoyt/scRNA/Drop-seq_tools-2.3.0/jar/lib/picard-2.18.14.jar!/com/intel/gkl/native/libgkl_compression.so
17:56:14.821 WARN NativeLibraryLoader - Unable to load libgkl_compression.so from native/libgkl_compression.so (No such file or directory)
[Mon Dec 16 17:56:14 CST 2019] ConvertToRefFlat ANNOTATIONS_FILE=../outputs/cat9.gtf SEQUENCE_DICTIONARY=../outputs/cat9.dict OUTPUT=../outputs/cat9.refFlat VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Mon Dec 16 17:56:14 CST 2019] Executing as phoyt@gpu01.cluster on Linux 2.6.32-279.5.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_112-b15; Deflater: Jdk; Inflater: Jdk; Provider GCS is not available; Picard version: 2.3.0(34e6572_1555443285)
INFO 2019-12-16 17:56:16 GTFParser read 100,000 GTF records. Elapsed time: 00:00:01s. Time for last 100,000: 1s. Last read position: NC_018724.3:3,809,593
INFO 2019-12-16 17:56:17 GTFParser read 200,000 GTF records. Elapsed time: 00:00:02s. Time for last 100,000: 0s. Last read position: NC_018724.3:146,949,425
<snip>
INFO 2019-12-16 17:56:29 GTFReader Chromosome disagreement(NC_018726.3, NC_018727.3, NC_018741.3, NC_018732.3, NC_018733.3) in GTF file for gene TRNAV-UAC -- skipping
INFO 2019-12-16 17:56:29 GTFReader Multiple gene IDs for gene CRX: gene29293, gene29533 -- skipping
INFO 2019-12-16 17:56:29 GTFReader Chromosome disagreement(NC_018727.3, NC_018728.3, NC_018739.3, NC_018734.3, NC_018736.3, NC_018730.3, NC_018732.3, NC_018740.3) in GTF file for gene TRNAK-UUU -- skipping
INFO 2019-12-16 17:56:30 GTFReader Chromosome disagreement(NC_018726.3, NC_018727.3, NC_018739.3, NC_018737.3, NW_019369290.1, NC_018732.3, NW_019369044.1, NC_018740.3) in GTF file for gene TRNAM-CAU -- skipping
INFO 2019-12-16 17:56:30 GTFReader Chromosome disagreement(NC_018727.3, NC_018723.3) in GTF file for gene TRNAL-CAA -- skipping
INFO 2019-12-16 17:56:30 GTFReader Chromosome disagreement(NC_018727.3, NC_018739.3, NC_018737.3, NW_019366542.1) in GTF file for gene TRNAL-CAG -- skipping
INFO 2019-12-16 17:56:30 GTFReader Chromosome disagreement(NC_018727.3, NC_018738.3, NC_018729.3, NW_019367629.1, NC_018736.3, NW_019369290.1) in GTF file for gene TRNAW-CCA -- skipping
Pete
Hi Alec,
I was able to narrow it down using your suggestion and at least the FIRST problem turns out to be between line 60 and 70 of my refFlat file. Yanking those lines out it looks pretty obvious which line is causing the error.
KIAA1211 rna17589 NC_018726.3 - 162541228 162823577 162543860 162569484 12 162541228,162552610,162553345,162560558,162564056,162566637,162569373,162603476,162684525,162744556,162822721,162823447, 162544021,162552773,162553536,162563379,162564239,162566706,162569538,162603648,162684667,162744641,162822813,162823577,
KIAA1211 rna17588 NC_018726.3 - 162541228 162823577 162543860 162577121 12 162541228,162552610,162553345,162560558,162564056,162566637,162569373,162577001,162603476,162684525,162822721,162823447, 162544021,162552773,162553536,162563379,162564239,162566706,162569538,162577137,162603648,162684667,162822813,162823577,
LOC111557364 rna50615 NC_018734.3 + 19102268 19105955 19102268 19105600 3 19102268,19102541,19105391, 19102283,19103058,19105955,
MAPKAPK3 rna6829 NC_018724.3 + 19506120 19533746 19525218 19532486 12 19506120,19523138,19525211,19526535,19527069,19528937,19529639,19530234,19530656,19531206,19531565,19532333, 19506229,19523228,19525351,19526600,19527149,19529061,19529769,19530310,19530781,19531292,19531646,19533746,
MAPKAPK3 rna6828 NC_018724.3 + 19505576 19533746 19506004 19532486 11 19505576,19505952,19525211,19526535,19527069,19528937,19530234,19530656,19531206,19531565,19532333, 19505651,19506229,19525351,19526600,19527149,19529061,19530310,19530781,19531292,19531646,19533746,
MAPKAPK3 rna6827 NC_018724.3 + 19505558 19533746 19506004 19532486 11 19505558,19505952,19525211,19526535,19527069,19528937,19530234,19530656,19531206,19531565,19532333, 19505681,19506229,19525351,19526600,19527149,19529061,19530310,19530781,19531292,19531646,19533746,
LOC111560961 gene12665 NC_018728.3 + 2147483646 -2147483648 75045677 75051353 0
LOC109503493 rna40565 NC_018731.3 + 72170232 72171300 72170232 72171300 2 72170232,72170610, 72170369,72171300,
LOC111558154 rna60143 NC_018737.3 + 7513632 7513730 7513632 7513730 1 7513632, 7513730,
LOC101088754 gene12138 NC_018728.3 + 49291526 49296291 49291526 49296291 1 49291526, 49296291,
LOC102902640 rna26581 NC_018728.3 - 124219427 124225245 124219427 124225245 3 124219427,124221630,124224160, 124220193,124221709,124225245,
The line LOC111560961 gene12665 NC_018728.3 + 2147483646 -2147483648 75045677 75051353 0
has a null value (a zero, followed by two tabs).
There are probably many more of these lines, but more importantly, how did they get there? When I check my original GTF file:
grep LOC111560961 cat9.gtf > LOC111560961.txt
The output is:
NC_018728.3 Gnomon CDS 75045678 75045729 . + 0 transcript_id "gene12665"; transcript_name "gene12665"; gene_id "gene12665"; gene_name "LOC111560961";
NC_018728.3 Gnomon CDS 75045904 75046316 . + 2 transcript_id "gene12665"; transcript_name "gene12665"; gene_id "gene12665"; gene_name "LOC111560961";
NC_018728.3 Gnomon CDS 75046835 75046875 . + 0 transcript_id "gene12665"; transcript_name "gene12665"; gene_id "gene12665"; gene_name "LOC111560961";
NC_018728.3 Gnomon CDS 75051137 75051353 . + 1 transcript_id "gene12665"; transcript_name "gene12665"; gene_id "gene12665"; gene_name "LOC111560961";
So the attributes are there, However the locus and genename are not in the reduced.gtf file
grep LOC111560961 cat9.reduced.gtf
grep gene12665 cat9.reduced.gtf
Gives empty files. Thanks for any advice. I really appreciate it.
Pete
This is the first 70 lines of my refFlat file
Hi Pete,
The problem is that there are no exons in this transcript. I think you need to remove any such transcript from your GTF and re-create the refFlat file. The signal is -2147483648 in the refFlat file.
Regards, Alec
Hi Alec,
I appreciate your help. I hope you don't mind but I am new to this analysis and am not sure how to determine if there are exons in the transcripts. Are you saying I should go through the refFlat file and parse out the lines with bad signals, and then use that information to remove those lines from the GTF file? I apologize for having to ask. I can likely wrangle any GTF file if I know exactly what I'm looking for.
Pete
Yes, that is what I'm saying. In GTF format, the definition of a transcript is spread across multiple lines, so it's hard to figure out which transcripts are lacking exons. ConvertToRefFlat, by accident, flags such transcripts with -2147483648. You can get the names of problematic transcripts by searching for that value, filter those transcripts out of GTF, and regenerate the refFlat.
I realize another alternative is simply to remove from your refFlat the lines containing -2147483648.
-Alec
Removing lines with -2147483648 worked for me.