broadinstitute/Drop-seq

NullPointerException when running TagReadWithGeneFunction, Drop-seq protocol

hoytpr opened this issue · 9 comments

I'm running Drop-seq_tools2.3.0 or our University cluster and have some scRNAseq data from our Illumina instrument. Drop-seq was installed in the last few days. Following the McCarroll lab protocol from 2018.

I have completed the create_Drop-seq_reference_metadata.sh script, and am running the Drop-seq_alignment.sh script on a small batch of sequence data (one sample with only a few thousand reads). These sequences were processed with bcl2fastq (most recent version) and then a new install of Picard 2.21.4-3gd7e09b8 (installed by cloning and building with gradle) using FastqToSam to generate the uBAM files. Unfortunately the NullPointerException doesn't give me much to troubleshoot. I have read other issues on this site, but none seemed to solve this exact error.

EDIT: I did not include that this cluster loads JAVA 1.8 using a module so module load jdk/1.8.0_112 must be part of my submission scripts.

TagReadWithInterval completes then during the TagReadWithGeneFunction step of the process, the following error occurs:

[Tue Dec 17 15:04:36 CST 2019] org.broadinstitute.dropseqrna.metrics.TagReadWithInterval done. Elapsed time: 0.02 minutes.
Runtime.totalMemory()=504889344
INFO	2019-12-17 15:04:37	TagReadWithGeneFunction	

********** NOTE: Picard's command line syntax is changing.
**********
********** For more information, please see:
********** https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)
**********
********** The command line looks like this in the new syntax:
**********
**********    TagReadWithGeneFunction -O ./dropseqtemp/function_tagged.bam -ANNOTATIONS_FILE ../outputs/cat9.refFlat -INPUT ./dropseqtemp/gene_tagged.bam
**********


15:04:37.696 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/panfs/panfs.cluster/scratch/phoyt/scRNA/Drop-seq_tools-2.3.0/jar/lib/picard-2.18.14.jar!/com/intel/gkl/native/libgkl_compression.so
15:04:37.708 WARN  NativeLibraryLoader - Unable to load libgkl_compression.so from native/libgkl_compression.so (No such file or directory)
15:04:37.710 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/panfs/panfs.cluster/scratch/phoyt/scRNA/Drop-seq_tools-2.3.0/jar/lib/picard-2.18.14.jar!/com/intel/gkl/native/libgkl_compression.so
15:04:37.710 WARN  NativeLibraryLoader - Unable to load libgkl_compression.so from native/libgkl_compression.so (No such file or directory)
[Tue Dec 17 15:04:37 CST 2019] TagReadWithGeneFunction INPUT=./dropseqtemp/gene_tagged.bam OUTPUT=./dropseqtemp/function_tagged.bam ANNOTATIONS_FILE=../outputs/cat9.refFlat    GENE_NAME_TAG=gn GENE_STRAND_TAG=gs GENE_FUNCTION_TAG=gf READ_FUNCTION_TAG=XF USE_STRAND_INFO=true VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Tue Dec 17 15:04:37 CST 2019] Executing as phoyt@n252.cluster on Linux 2.6.32-279.5.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_112-b15; Deflater: Jdk; Inflater: Jdk; Provider GCS is not available; Picard version: 2.3.0(34e6572_1555443285)
15:04:37.735 WARN  IntelDeflaterFactory - IntelInflater is not supported, using Java.util.zip.Inflater
[Tue Dec 17 15:04:38 CST 2019] org.broadinstitute.dropseqrna.metrics.TagReadWithGeneFunction done. Elapsed time: 0.02 minutes.
Runtime.totalMemory()=504889344
Exception in thread "main" java.lang.NullPointerException
	at picard.annotation.RefFlatReader.makeTranscriptFromRefFlatLine(RefFlatReader.java:161)
	at picard.annotation.RefFlatReader.makeGeneFromRefFlatLines(RefFlatReader.java:147)
	at picard.annotation.RefFlatReader.load(RefFlatReader.java:104)
	at picard.annotation.RefFlatReader.load(RefFlatReader.java:66)
	at picard.annotation.GeneAnnotationReader.loadRefFlat(GeneAnnotationReader.java:37)
	at org.broadinstitute.dropseqrna.annotation.GeneAnnotationReader.loadRefFlat(GeneAnnotationReader.java:72)
	at org.broadinstitute.dropseqrna.annotation.GeneAnnotationReader.loadAnnotationsFile(GeneAnnotationReader.java:53)
	at org.broadinstitute.dropseqrna.metrics.TagReadWithGeneFunction.doWork(TagReadWithGeneFunction.java:113)
	at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:295)
	at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
	at org.broadinstitute.dropseqrna.cmdline.DropSeqMain.main(DropSeqMain.java:42)

Steps to reproduce

I do not have additional files to reproduce at this time, but will gladly run larger files if it will help. While inexperienced in this process, the GTF and reduced.GTF files look correctly formatted however I'm not very familiar with the .refFlat format:

(top of) GTF FILE:

NC_001700.1     RefSeq  CDS     3615    4570    .       +       0       transcript_id "gene35575"; transcript_name "gene35575"; gene_id "gene35575"; gene_name "ND1";
NC_001700.1     RefSeq  CDS     4781    5822    .       +       0       transcript_id "gene35576"; transcript_name "gene35576"; gene_id "gene35576"; gene_name "ND2";
NC_001700.1     RefSeq  CDS     6216    7760    .       +       0       transcript_id "gene35577"; transcript_name "gene35577"; gene_id "gene35577"; gene_name "COX1";

(top of) reduced.GTF file:

chr     start   end     strand  gene_name       gene_id transcript_name transcript_id   transcriptType  annotationType
NC_018723.3     79442   79757   +       LOC102899061    gene0   rna0    rna0    NA      exon
NC_018723.3     79442   84501   +       LOC102899061    gene0   NA      NA      NA      gene
NC_018723.3     79442   84501   +       LOC102899061    gene0   rna0    rna0    NA      transcript

(top of) refFlat file:

LOC109496728    rna5513 NC_018724.3     +       6794351 6798252 6794351 6798252 2       6794351,6797801,        6794446,6798252,
NPPB    rna32694        NC_018730.3     -       8677085 8678561 8677263 8678472 3       8677085,8677854,8678346,        8677280,8678110,8678561,
PTF1A   rna28013        NC_018729.3     +       21784047        21785957        21784222        21785608        4       21784047,21784664,21784917,21785405,        21784663,21784916,21785011,21785957,
LOC109495981    rna68304        NC_018740.3     +       10478959        10484355        10478959        10484355        3       10478959,10483664,10484231, 10479105,10483862,10484355,

Any advice would be greatly appreciated.

alecw commented

Hi Pete,

Is your refFlat file tab-separated? It doesn't look that way, but that could be github reformatting. One thing you can do to narrow down the problem is to run ConvertToRefFlat on your refFlat file. I know that is counter-intuitive, but if input is refFlat, it uses the same code to read the file. You can then split the refFlat in half, and run ConvertToRefFlat separately on each half to see which has the problem. Keep doing that until you've got a smallish refFlat that causes the problem, and then attach that file, and also your sequence dictionary, and then I can debug.

Regards, Alec

Hi Alec and thanks. The refFlat file is tab-delimited. I'm working on the ConvertToRefFlat splitting now.

I went through my history and noticed the first run through ConvertToRefFlat I had no problems with the GTF parser, but the GTFReader gave multiple "INFO" outputs of "Chromosome Disagreement" that appeared to be almost all tRNAs (607 lines total) which were "skipped". These did not cause any errors or warnings later, so I had ignored them based on researching others who had similar issues. If this is helpful, I apologize for not mentioning it earlier.

[Mon Dec 16 17:56:00 CST 2019] FilterGtf GTF=../picard/cat9.gtf OUTPUT=../outputs/cat9.gtf SEQUENCE_DICTIONARY=../outputs/cat9.dict    VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Mon Dec 16 17:56:00 CST 2019] Executing as phoyt@gpu01.cluster on Linux 2.6.32-279.5.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_112-b15; Deflater: Jdk; Inflater: Jdk; Provider GCS is not available; Picard version: 2.3.0(34e6572_1555443285)
[Mon Dec 16 17:56:13 CST 2019] org.broadinstitute.dropseqrna.annotation.FilterGtf done. Elapsed time: 0.22 minutes.
Runtime.totalMemory()=2849505280
INFO	2019-12-16 17:56:14	ConvertToRefFlat	

********** NOTE: Picard's command line syntax is changing.
**********
********** For more information, please see:
********** https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)
**********
********** The command line looks like this in the new syntax:
**********
**********    ConvertToRefFlat -ANNOTATIONS_FILE ../outputs/cat9.gtf -SEQUENCE_DICTIONARY ../outputs/cat9.dict -OUTPUT ../outputs/cat9.refFlat
**********


17:56:14.807 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/panfs/panfs.cluster/scratch/phoyt/scRNA/Drop-seq_tools-2.3.0/jar/lib/picard-2.18.14.jar!/com/intel/gkl/native/libgkl_compression.so
17:56:14.819 WARN  NativeLibraryLoader - Unable to load libgkl_compression.so from native/libgkl_compression.so (No such file or directory)
17:56:14.820 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/panfs/panfs.cluster/scratch/phoyt/scRNA/Drop-seq_tools-2.3.0/jar/lib/picard-2.18.14.jar!/com/intel/gkl/native/libgkl_compression.so
17:56:14.821 WARN  NativeLibraryLoader - Unable to load libgkl_compression.so from native/libgkl_compression.so (No such file or directory)
[Mon Dec 16 17:56:14 CST 2019] ConvertToRefFlat ANNOTATIONS_FILE=../outputs/cat9.gtf SEQUENCE_DICTIONARY=../outputs/cat9.dict OUTPUT=../outputs/cat9.refFlat    VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Mon Dec 16 17:56:14 CST 2019] Executing as phoyt@gpu01.cluster on Linux 2.6.32-279.5.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_112-b15; Deflater: Jdk; Inflater: Jdk; Provider GCS is not available; Picard version: 2.3.0(34e6572_1555443285)
INFO	2019-12-16 17:56:16	GTFParser	read       100,000 GTF records.  Elapsed time: 00:00:01s.  Time for last 100,000:    1s.  Last read position: NC_018724.3:3,809,593
INFO	2019-12-16 17:56:17	GTFParser	read       200,000 GTF records.  Elapsed time: 00:00:02s.  Time for last 100,000:    0s.  Last read position: NC_018724.3:146,949,425

<snip>

INFO	2019-12-16 17:56:29	GTFReader	Chromosome disagreement(NC_018726.3, NC_018727.3, NC_018741.3, NC_018732.3, NC_018733.3) in GTF file for gene TRNAV-UAC -- skipping
INFO	2019-12-16 17:56:29	GTFReader	Multiple gene IDs for gene CRX: gene29293, gene29533 -- skipping
INFO	2019-12-16 17:56:29	GTFReader	Chromosome disagreement(NC_018727.3, NC_018728.3, NC_018739.3, NC_018734.3, NC_018736.3, NC_018730.3, NC_018732.3, NC_018740.3) in GTF file for gene TRNAK-UUU -- skipping
INFO	2019-12-16 17:56:30	GTFReader	Chromosome disagreement(NC_018726.3, NC_018727.3, NC_018739.3, NC_018737.3, NW_019369290.1, NC_018732.3, NW_019369044.1, NC_018740.3) in GTF file for gene TRNAM-CAU -- skipping
INFO	2019-12-16 17:56:30	GTFReader	Chromosome disagreement(NC_018727.3, NC_018723.3) in GTF file for gene TRNAL-CAA -- skipping
INFO	2019-12-16 17:56:30	GTFReader	Chromosome disagreement(NC_018727.3, NC_018739.3, NC_018737.3, NW_019366542.1) in GTF file for gene TRNAL-CAG -- skipping
INFO	2019-12-16 17:56:30	GTFReader	Chromosome disagreement(NC_018727.3, NC_018738.3, NC_018729.3, NW_019367629.1, NC_018736.3, NW_019369290.1) in GTF file for gene TRNAW-CCA -- skipping

Pete

Hi Alec,
I was able to narrow it down using your suggestion and at least the FIRST problem turns out to be between line 60 and 70 of my refFlat file. Yanking those lines out it looks pretty obvious which line is causing the error.

KIAA1211	rna17589	NC_018726.3	-	162541228	162823577	162543860	162569484	12	162541228,162552610,162553345,162560558,162564056,162566637,162569373,162603476,162684525,162744556,162822721,162823447,	162544021,162552773,162553536,162563379,162564239,162566706,162569538,162603648,162684667,162744641,162822813,162823577,
KIAA1211	rna17588	NC_018726.3	-	162541228	162823577	162543860	162577121	12	162541228,162552610,162553345,162560558,162564056,162566637,162569373,162577001,162603476,162684525,162822721,162823447,	162544021,162552773,162553536,162563379,162564239,162566706,162569538,162577137,162603648,162684667,162822813,162823577,
LOC111557364	rna50615	NC_018734.3	+	19102268	19105955	19102268	19105600	3	19102268,19102541,19105391,	19102283,19103058,19105955,
MAPKAPK3	rna6829	NC_018724.3	+	19506120	19533746	19525218	19532486	12	19506120,19523138,19525211,19526535,19527069,19528937,19529639,19530234,19530656,19531206,19531565,19532333,	19506229,19523228,19525351,19526600,19527149,19529061,19529769,19530310,19530781,19531292,19531646,19533746,
MAPKAPK3	rna6828	NC_018724.3	+	19505576	19533746	19506004	19532486	11	19505576,19505952,19525211,19526535,19527069,19528937,19530234,19530656,19531206,19531565,19532333,	19505651,19506229,19525351,19526600,19527149,19529061,19530310,19530781,19531292,19531646,19533746,
MAPKAPK3	rna6827	NC_018724.3	+	19505558	19533746	19506004	19532486	11	19505558,19505952,19525211,19526535,19527069,19528937,19530234,19530656,19531206,19531565,19532333,	19505681,19506229,19525351,19526600,19527149,19529061,19530310,19530781,19531292,19531646,19533746,
LOC111560961	gene12665	NC_018728.3	+	2147483646	-2147483648	75045677	75051353	0		
LOC109503493	rna40565	NC_018731.3	+	72170232	72171300	72170232	72171300	2	72170232,72170610,	72170369,72171300,
LOC111558154	rna60143	NC_018737.3	+	7513632	7513730	7513632	7513730	1	7513632,	7513730,
LOC101088754	gene12138	NC_018728.3	+	49291526	49296291	49291526	49296291	1	49291526,	49296291,
LOC102902640	rna26581	NC_018728.3	-	124219427	124225245	124219427	124225245	3	124219427,124221630,124224160,	124220193,124221709,124225245,

The line LOC111560961 gene12665 NC_018728.3 + 2147483646 -2147483648 75045677 75051353 0 has a null value (a zero, followed by two tabs).

There are probably many more of these lines, but more importantly, how did they get there? When I check my original GTF file:

grep LOC111560961 cat9.gtf > LOC111560961.txt

The output is:

NC_018728.3	Gnomon	CDS	75045678	75045729	.	+	0	transcript_id "gene12665"; transcript_name "gene12665"; gene_id "gene12665"; gene_name "LOC111560961";
NC_018728.3	Gnomon	CDS	75045904	75046316	.	+	2	transcript_id "gene12665"; transcript_name "gene12665"; gene_id "gene12665"; gene_name "LOC111560961";
NC_018728.3	Gnomon	CDS	75046835	75046875	.	+	0	transcript_id "gene12665"; transcript_name "gene12665"; gene_id "gene12665"; gene_name "LOC111560961";
NC_018728.3	Gnomon	CDS	75051137	75051353	.	+	1	transcript_id "gene12665"; transcript_name "gene12665"; gene_id "gene12665"; gene_name "LOC111560961";

So the attributes are there, However the locus and genename are not in the reduced.gtf file

grep LOC111560961 cat9.reduced.gtf
grep gene12665 cat9.reduced.gtf

Gives empty files. Thanks for any advice. I really appreciate it.

Pete

Here is my .dict file.

cat9.zip

This is the first 70 lines of my refFlat file

first70.zip

alecw commented

Hi Pete,

The problem is that there are no exons in this transcript. I think you need to remove any such transcript from your GTF and re-create the refFlat file. The signal is -2147483648 in the refFlat file.

Regards, Alec

Hi Alec,
I appreciate your help. I hope you don't mind but I am new to this analysis and am not sure how to determine if there are exons in the transcripts. Are you saying I should go through the refFlat file and parse out the lines with bad signals, and then use that information to remove those lines from the GTF file? I apologize for having to ask. I can likely wrangle any GTF file if I know exactly what I'm looking for.

Pete

alecw commented

Yes, that is what I'm saying. In GTF format, the definition of a transcript is spread across multiple lines, so it's hard to figure out which transcripts are lacking exons. ConvertToRefFlat, by accident, flags such transcripts with -2147483648. You can get the names of problematic transcripts by searching for that value, filter those transcripts out of GTF, and regenerate the refFlat.

I realize another alternative is simply to remove from your refFlat the lines containing -2147483648.

-Alec

Removing lines with -2147483648 worked for me.