broadinstitute/picard

CollectWgsMetrics error, out of bound of the index?

Closed this issue · 8 comments

Hello,

Thanks for a great tool. I am new using picard. After aligning the reads with bwa mem and converting the sam to bam file, I am trying to perform some QC in the aligned reads. I ran the "CollectAlignmentSummaryMetrics" command without problems, but after trying
"CollectWgsMetrics" I encounter an error.

I have used the following code:

java -jar Documents/Apps/picard2.27/picard.jar CollectWgsMetrics \
--INPUT "/home/dna_server/GWAS_cacao_data/01.RawData/6.Reads_aln/CAU37_2_CKDL240009448-1A_HJJ3JCCX2_L6_val_Tc_Criollo_coordinate_sorted.bam" \
--REFERENCE_SEQUENCE "/home/dna_server/GWAS_cacao_data/01.RawData/5.Ref_genome/Criollo/Theobroma_cacao_criollo_chr.v2.0.fna" \
--OUTPUT "/home/dna_server/GWAS_cacao_data/01.RawData/7.variant_calling/picard_metrics/CAU72_2.CollectWgsMetrics.txt"

It starts collecting the information and before finishing I obtained the following error:

[Sat Jun 08 22:34:48 ECT 2024] picard.analysis.CollectWgsMetrics done. Elapsed time: 3.53 minutes.
Runtime.totalMemory()=2275409920
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index 10571689 out of bounds for length 10571689
        at picard.analysis.AbstractWgsMetricsCollector.isReferenceBaseN(AbstractWgsMetricsCollector.java:218)
        at picard.analysis.WgsMetricsProcessorImpl.processFile(WgsMetricsProcessorImpl.java:92)
        at picard.analysis.CollectWgsMetrics.doWork(CollectWgsMetrics.java:236)
        at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:309)
        at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
        at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)

I need to mention that the reference genome above is the same one I used to align the reads in previous steps and also in the "CollectAlignmentSummaryMetrics" command.

I apologize if I am making a basic mistake. I hope you can help me.

JB

Thank for this. This might be a bug in the way CollectWgsMetrics iterates over the bam file.

In the log file, do you see the last position that was logged?

Perhaps one of your reads was aligned right up-to or even beyond the edge of the reference. Could you tell me if one of the ~430 contigs of this genome if any of them has length 10571689 (should be in the .dict file)?

Also, if so, would you be able to extract some reads from that region (using samtools view) and post them here? interested in seeing if the same reads on their own also cause this failure.

Thanks for your reply Yossi. I'll try to answer you as best as possible. I am no expert in using the program or programming.

Yes, the last chromosome (chrUn_random) has a length of 10571689, you can check the .dict file attached.

I created the .dict file for my genome (it has 11 chromosomes) and when I reran the same code the error was:

Exception in thread "main" htsjdk.samtools.util.SequenceUtil$SequenceListsDifferException: Sequences at index 10 don't match: 10/10575761/chrUn_random 10/10571689/chrUn_random/M5=298e24e552c2d6c5acaa4cb8f4116087/UR=file:/home/dna_server/GWAS_cacao_data/01.RawData/5.Ref_genome/Criollo/Theobroma_cacao_criollo_chr.v2.0.fna
        at htsjdk.samtools.util.SequenceUtil.assertSequenceListsEqual(SequenceUtil.java:280)
        at htsjdk.samtools.util.SequenceUtil.assertSequenceDictionariesEqual(SequenceUtil.java:342)
        at htsjdk.samtools.util.SequenceUtil.assertSequenceDictionariesEqual(SequenceUtil.java:328)
        at picard.analysis.CollectWgsMetrics.doWork(CollectWgsMetrics.java:214)
        at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:309)
        at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
        at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)

If it is of interest, I tried the same but using picard 3.0 and the error was more detailed.

Exception in thread "main" picard.PicardException: Sequence dictionary for (/home/dna_server/GWAS_cacao_data/01.RawData/6.Reads_aln/CAU37_2_CKDL240009448-1A_HJJ3JCCX2_L6_val_Tc_Criollo_coordinate_sorted.bam) does not match sequence dictionary
for (/home/dna_server/GWAS_cacao_data/01.RawData/5.Ref_genome/Criollo/Theobroma_cacao_criollo_chr.v2.0.fna)                                                                                                                                                at picard.util.SequenceDictionaryUtils.assertSequenceDictionariesEqual(SequenceDictionaryUtils.java:214)                                 
at picard.analysis.CollectWgsMetrics.doWork(CollectWgsMetrics.java:216)
        at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:280)
        at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:105)
        at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:115)
Caused by: htsjdk.samtools.util.SequenceUtil$SequenceListsDifferException: Sequences at index 10 don't match: 10/10575761/chrUn_random 10
/10571689/chrUn_random/M5=298e24e552c2d6c5acaa4cb8f4116087/UR=file:/home/dna_server/GWAS_cacao_data/01.Raw
Data/5.Ref_genome/Criollo/Theobroma_cacao_criollo_chr.v2.0.fna                                                                                   at htsjdk.samtools.util.SequenceUtil.assertSequenceListsEqual(SequenceUtil.java:280)                                                     at htsjdk.samtools.util.SequenceUtil.assertSequenceDictionariesEqual(SequenceUtil.java:342)                                              at htsjdk.samtools.util.SequenceUtil.assertSequenceDictionariesEqual(SequenceUtil.java:328)                                              at picard.util.SequenceDictionaryUtils.assertSequenceDictionariesEqual(SequenceDictionaryUtils.java:211)
        ... 4 more  

Sorry, but I don't know where is the log file located, or how to create the log file. If you guide me, I'll create it and send/check it as well.

Here are the extracted reads from the bam file for chrUn_random and the .dict file.

Theobroma_cacao_criollo_chr.v2.0.dict.zip

Thanks.

OK.

Don't hand-craft the dict file. that's not what I meant.

I meant that you subset your bam file

samtools view CAU37_2_CKDL240009448-1A_HJJ3JCCX2_L6_val_Tc_Criollo_coordinate_sorted.bam chrUn_random:10571089:10571688 -o subset.bam

then try running CollectWGS metrics on the resulting subset.bam and also upload it here.

If upload the file to here, I'll take a look.

I just realized that you already uploaded the sam file, but this was extracted without the header, so I cannot use it with picard. Please subset using the command from my previous comment.

I just created the new subset with the -o command. Here it is
subset.zip

I ran the metrics on the subset and got the same error:

[Mon Jun 10 14:11:26 ECT 2024] picard.analysis.CollectWgsMetrics done. Elapsed time: 1.12 minutes.
Runtime.totalMemory()=4719116288
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 10571689
        at picard.analysis.AbstractWgsMetricsCollector.isReferenceBaseN(AbstractWgsMetricsCollector.java:218)
        at picard.analysis.WgsMetricsProcessorImpl.processFile(WgsMetricsProcessorImpl.java:92)
        at picard.analysis.CollectWgsMetrics.doWork(CollectWgsMetrics.java:236)
        at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:309)
        at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
        at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)

which reference are you using? the latest reference I see has a different sized chrUn_random sequence (I see LN:10571689, but your bam shows LN:10575761 which is a few bases longer...)

if you don't care so much and in particular, would like to ignore this contig (chrUn_random) you can create a region of interest interval list (create a bed file of the contigs chr1-chr10 and then use picard's BedToIntervalList to convert that to an intervalList when provided a sequence dictionary), and provide that to CollectWgsMetrics.....

I am using the latest reference, I just have a single one in my computer. This is strange perhaps I made a mistake in a previous step?? Your solution would work OK, as this contig maybe be ignored. I will repeat all previous steps, to see if this repeats. Thanks for your help!

Seems to be resolved, reopen if necessary