broadinstitute/picard

Problem with Picard tool for Collecting RNASeqMetrics

annepureddy opened this issue · 4 comments

Hello,
I am using the STAR aligner to align bulk RNA-seq samples generated from Smart-Seq-2 protocol using the Nextera XT Dna library prep kit.
I have no errors and am able to successfully generate all the output files.
Then I try to use the picard Collect RNA-seq metrics tool using:

  picard CollectRnaSeqMetrics \
    I= ${sample}.Aligned.sortedByCoord.out.bam \
    O=${sample}.output.RNA_Metrics \
    REF_FLAT="/Users/f006f9w/Dropbox (Dartmouth College)/Goods Lab/Projects/MFGM_NIH_Rimi/Bulk RNA-Seq/Bam_files/refFlat.txt" \
    STRAND=SECOND_READ_TRANSCRIPTION_STRAND \
    RIBOSOMAL_INTERVALS=GRCh38.p5.rRNA.interval_list

However, my .RNA_metrics file does not have any coding, UTR or intronic bases. It would be great if you could please let me know what the issue is here.

## METRICS CLASS	picard.analysis.RnaSeqMetrics
PF_BASES	PF_ALIGNED_BASES	RIBOSOMAL_BASES	CODING_BASES	UTR_BASES	INTRONIC_BASES	INTERGENIC_BASES	IGNORED_READS	CORRECT_STRAND_READS	INCORRECT_STRAND_READS	NUM_R1_TRANSCRIPT_STRAND_READS	NUM_R2_TRANSCRIPT_STRAND_READS	NUM_UNEXPLAINED_READS	PCT_R1_TRANSCRIPT_STRAND_READS	PCT_R2_TRANSCRIPT_STRAND_READS	PCT_RIBOSOMAL_BASES	PCT_CODING_BASES	PCT_UTR_BASES	PCT_INTRONIC_BASES	PCT_INTERGENIC_BASES	PCT_MRNA_BASES	PCT_USABLE_BASES	PCT_CORRECT_STRAND_READS	MEDIAN_CV_COVERAGE	MEDIAN_5PRIME_BIAS	MEDIAN_3PRIME_BIAS	MEDIAN_5PRIME_TO_3PRIME_BIAS	SAMPLE	LIBRARY	READ_GROUP
923592354	902491934	1438	0	0	0	902490496	0	0	0	0	0	0	0	0	0.000002	0	0	0	0.999998	0	0	0	0	0	0	0			

Thank you for the help!

There may be an issue with your refflat file. Could you share a few lines of the refflat file you are using?

This is the head of the refFlat.txt file that I am using

DDX11L1 rna0    chr1    +       11873   14409   14409   14409   3       11873,12612,13220,      12227,12721,14409,
WASH7P  rna1    chr1    -       14361   29370   29370   29370   11      14361,14969,15795,16606,16857,17232,17605,17914,18267,24737,29320,       14829,15038,15947,16765,17055,17368,17742,18061,18366,24891,29370,
MIR6859-1       rna3    chr1    -       17368   17391   17391   17391   1       17368,  17391,
MIR6859-1       rna2    chr1    -       17368   17436   17436   17436   1       17368,  17436,
MIR6859-1       rna4    chr1    -       17408   17431   17431   17431   1       17408,  17431,
MIR1302-2       rna5    chr1    +       30365   30503   30503   30503   1       30365,  30503,
MIR1302-2       rna6    chr1    +       30437   30458   30458   30458   1       30437,  30458,
FAM138A rna7    chr1    -       34610   36081   36081   36081   3       34610,35276,35720,      35174,35481,36081,
OR4F5   NM_001005484.1  chr1    +       69090   70008   69090   70008   1       69090,  70008,
LOC101927589    rna9    chr1    -       120711  133748  133748  133748  3       120711,129054,133373,   120932,129223,133748,
kockan commented

The refflat itself looks fine, but I suspect it might not be read/processed properly by the program. It seems to run and produce some output, so the input bam and rRNA interval list don't look too suspicious at this point. There is a whitespace after I= which gives a very slight uneasiness to me but again, since we have some metrics out of the alignment file, it should be okay.

Given that PF_ALIGNED_BASES, RIBOSOMAL_BASES, and INTERGENIC_BASES are reported (as non-zero values) and the default label of a base not overlapping with rRNA or a gene is intergenic, the most likely culprit is still the refflat file. It looks like the path to the refflat contains some whitespaces, so might be worth checking this out. Could you copy the file locally to something like "test_refflat.txt" and try again? If this also fails, looking at the program log and the full header from the metric output file might also help.

Most likely issue is that the refflat is somehow malformed (whitespace, etc.) or there is an issue reading it due to path, etc. Closing for now, reopen if user has updates.