mozack/abra2

workflow of data pre-processing of RNAseq

SaHash opened this issue · 6 comments

Thank you for creating this software. This seems to improve the accuracy of somatic variant calling with my RNAseq data.

I have a question about data pre-processing workflow for RNAseq data.
According to GATK best practice, data processing workflow consists of these steps;

  1. STAR 2-pass mapping
  2. MergeBamAlignment (Picard); merge with unmapped BAM
  3. MarkDuplicates (Picard)
  4. SplitNCigarReads (GATK)
  5. BaseRecalibrator & ApplyBQSR (GATK)

Because I want to use non-GATK variant caller such as VarScan2, I'd like to perform indel realignment using ABRA2 in addition to GATK workflow.
Is it necessary to perform SplitNCigarReads and BaseRecalibration after ABRA2? (Can I use ABRA2 between step3 and step4 described above?)

Thanks!

We've run the GATK tools in conjuction with ABRA2 only a handful of times on RNA. They ran to completion in our tests, but I have not attempted to optimize and/or evaluate the impact of each individual GATK step. In these tests, we ran ABRA2 after marking duplicates and before SplitNCigarReads as you have suggested.

As for the impact of SplitNCigarReads and BQSR on VarScan2, I am not really sure. I believe VarScan2 is capable of handling reads with introns (unlike the GATK variant callers). However, I do not know what impact those last 2 steps will have on VarScan2's accuracy. I would suggest running a test both with and without steps 4 and 5. I would be interested in hearing about your results.

Thank you for replying. I will compare the result with and without steps4 and 5. Thanks.

Hi, @mozack
Sorry to bother you again. I've tried to run SplitNCigarReads (GATK v4.1.4.1) after MarkDuplicates and ABRA2, but I got error as below despite I used -fixNDN flag.

java.lang.IllegalArgumentException: Cannot split this read (might be an empty section between Ns, for example 1N1D1N): 32M495N192D1367N4M

I didn't get the error when I ran SplitNCigarReads right after MarkDuplicates (before ABRA2).
Didn't you face the same error when you ran SplitNCigarReads after ABRA2 in your test data?

I did not run into this in my testing (it looks like the version you are using is a bit newer).

There is no -fixNDN flag in ABRA2 - or are you referring to SplitNCigarReads?

If this is causing a problem for GATK, you might consider running with the --no-ndn flag in ABRA2.

Thank you for your response. I meant this error occurred with SplitNCigarReads and -fixNDN is the option of SplitNCigarReads. Sorry for asking a question even though it was not an error in ABRA2.
After running ABRA2 with --no-ndn as you said, I could perform SplitNCigarReads without the error. Thank you so much.

No problem and glad you got past that error.

I will add that I think removing the introns from all of the alignments is in general not a good idea (BQSR actually considers the position within the read which I believe is lost when the reads are split) and we have avoided using GATK for RNASeq data in our own pipelines for these reasons. That said, I have not carefully tested the impact of the GATK post alignment tools on various variant callers