broadinstitute/Drop-seq

Is that possible to skip the SamtoFastq and map with bam file directly

ysu2015 opened this issue · 8 comments

Hi,
There is a SamtoFastq step after all the tagging and triming step before mapping with STAR. I tried to mapped directly with the $_unaligned_mc_tagged_polyA_filtered.bam instead of convert to Fastq first. The mapping step works fine. But when I go the the next step "sortSam", I got error like below.
"Exception in thread "main" htsjdk.samtools.SAMFormatException: Error parsing text SAM file. RG ID on SAMRecord not found in header:" I tried using "AddOrReplaceReadGroups" to revised the sam file, not work.
Any idea what could be get wrong?and any solutions? Thank you so much.

Y

alecw commented

Hi @ysu2015

AFAIK, STAR needs FASTQ input. I have no idea what STAR will do if you don't give it FASTQ input, and I don't understand why you would omit that step. I'm surprised STAR didn't fail. If you don't follow the instructions we provide, you're on your own.

Regards, Alec

Hi Alec,

Thank you for you quick response. STAR actually take bam file as input. (FYI, I used STAR 2.7.6a), and I just curious and think if we could skip the SamtoFastq step. And it turn out that STAR works fine. Just the next step did not work through.

Y

alecw commented

OK, I didn't realize that. Does it work if you run SamToFastq? If so, then I think the thing to do is to try it both ways, and see what is different between STAR output from BAM input, and STAR output from FASTQ input.

Good point. Will give a try.

Y

Hi Alex,
Here is what I got. FYI, they are not come from the same dataset.

So, if I run samTofastq and mapped with STAR, I got
NB551268:196:HTML7BGXC:1:11101:18609:6569 0 chr2 240137037 255 1S34M4S * 0 0 CATGTGGTCCATACACACAGTGGAACATTACTCAGTCGTAAAAAEEEEEAEEEEE/EEEEEEEEEEEEEEEEEEEEEE NH:i:1 HI:i:1 AS:i:27 nM:i:3

If I mapped directly with the tagged&Trimed bam file, I got
NB551268:214:HJ773BGXF:1:11101:10000:10876 16 chr10 48435445 255 66M * 0 0 TCAGATGTTTTGTTCATGATACTATCCTTCAGGGTTATGTGCTTATCAATGAAATAACCCCAGAGG EEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA NH:i:1 HI:i:1 AS:i:64 nM:i:0 XC:Z:GGGGGGGGGGGGGGGGGGGGGGGG RG:Z:Split XM:Z:GGGGGGGGGG

It looks like there are some tags been take over to the mapping results, which impact the downstream script.
Then when I look into the .merge.bam, which used as input for "TagReadWithGeneFunction", I found that those tags have been put back again, and added some others. (example like below). This make me think maybe I could use the "bam_direct_mapped_bam" as input for "TagReadWithGeneFunction", if those additional tags are not use for downstream analysis. What do you think? Thank you for discussion.
NB551268:196:HTML7BGXC:1:13108:12554:11558 16 chr1 10145 3 15S51M * 0 0 ATGGTGCTAAACAACAATCCCTAACCCTATCCCTATCCCTAACCCTAACCTAACCCTAACCCTAAC EAEEEEEAEEEEEEEEEEEEEEEEEEEA6EEEEEEEEEEEEEEEEEEEEEEE6EEEEEAEEAAAAA XC:Z:TCCGTCTAAGTGGTCAAGCAGAAT MD:Z:2C11A5A30 PG:Z:STAR RG:Z:Split NH:i:2 NM:i:3 XM:Z:TGCTTGCTGT UQ:i:108 AS:i:44

Oooo, still facing the error:
Exception in thread "main" htsjdk.samtools.SAMFormatException: Error parsing text SAM file. RG ID on SAMRecord not found in header
I guess, I need to reheader the bam file anyway.

Y

alecw commented

Hi @ysu2015 ,

It appears this is a flaw in STAR. Presumably the BAM that is input to STAR has an @rg record in the header, but that is removed from the output. It would be good if you reported this to the STAR developers so they can fix that at some point.

In general, we've found aligners to be bad at creating BAMs that pass the rigorous validation that is the default for Picard tools. That's why years ago we created MergeBamAlignment, which extracts only the alignment info from the aligned BAM, and everything else is obtained from the unmapped BAM.

Absent a fix from STAR, there are a few ways to fix this problem:

  • Presumably the RG tags that are causing the problem are produced by whatever tool you are using to create the original unmapped BAM. Not emitting RG tags would solve the problem. This doesn't appear to be possible with Picard IlluminaBasecallsToSam.

  • You can disable this validation by passing VALIDATION_STRINGENCY=LENIENT or VALIDATION_STRINGENCY=SILENTto SortSam. You might also need to pass this to MergeBamAlignment. I dislike giving this option because disabling validation opens the door to all sorts of problems that are hard for us to support.

  • You can convert to FASTQ as we recommend.

Regards, Alec

Hi Alec,
Thank you so much for the clarification. Very helpful.
Bests,
Y