Alternate or duplicate alignment in output
jgaupp-invitae opened this issue · 1 comments
Hi,
I've got a question about duplicate reads seen in Abra2 (v. 2.23) output during regression testing. For context, I've recently replaced Abra with Abra2 in a long standing RNA/DNA variant calling workflow. The move to Abra2 was prompted by a switch to STAR for RNA alignment. All of the DNA workflow, with exception of Abra, remained the same. A handful of DNA based regression tests "failed" with slight improvements in variant allele frequencies. However, one test failed with a lost variant. After a little investigation, I found that a post-alignment counting step was balking at duplicate reads created by Abra2.
In the example below, a read pair from the input bam is ouptut by Abra2 as a pair plus an alternate, sense strand read with an upstream aligned YA tag added (YA:Z:chr8:117878503:354M16I431M).
Input Bam
---------
ATTTCCGTACAATCAACACTTAGGTTTT_molbar_1 147 chr8 117878856 30 150M = 117878784 -222 ACTAAATTACACTCGAACACATGGGCTTTGGTTAGCTTCTTATCCCAATGGGCCGCTAGCCAAATTTTGGCCAGAGGCCCTCTTTTACTGAGAACAAAATGTGCGTAGAACATTGTTCTGGCTGGCTATGAAAACAGAAGAAAACCTAAG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGHHCGGFFFFGHGGGGGECGHHHGGGHHHHHGGHHHHHGGGGGHGHHHHHHFHHHHHHHHHHHHHHHHHHH!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:0T149 YS:i:-53 YT:Z:CP AS:i:295 RG:Z:Sample_11_R1 MQ:i:30 MC:Z:73M16I40M
ATTTCCGTACAATCAACACTTAGGTTTT_molbar_1 99 chr8 117878784 30 73M16I40M = 117878856 222 ACAATCAACAAAGTGAATAAAAATGTCAACATCAAACATACCTTTGGTGAGATGATACTCTCCACGCTGCTCTCTAAATTACACTCGAACTAAATTACACTCGAACACATGGGCTTTGGTTAGCTTCTT GHHHHHHHHHHHHHHHHHHHHHHHHHHHFHHHHHHGHGGGGGHHHHHGGHHHHHGGGHHHGCEGGGGGHGFFFFGGCHHGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG XN:i:0 XM:i:0 XO:i:1 XG:i:16 NM:i:16 MD:Z:113 YS:i:-5 YT:Z:CP AS:i:205 RG:Z:Sample_11_R1 MQ:i:30 MC:Z:150M
Output Bam
----------
ATTTCCGTACAATCAACACTTAGGTTTT_molbar_1 147 chr8 117878856 30 150M = 117878784 -222 ACTAAATTACACTCGAACACATGGGCTTTGGTTAGCTTCTTATCCCAATGGGCCGCTAGCCAAATTTTGGCCAGAGGCCCTCTTTTACTGAGAACAAAATGTGCGTAGAACATTGTTCTGGCTGGCTATGAAAACAGAAGAAAACCTAAG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGHHCGGFFFFGHGGGGGECGHHHGGGHHHHHGGHHHHHGGGGGHGHHHHHHFHHHHHHHHHHHHHHHHHHH!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:0T149 YS:i:-53 YT:Z:CP AS:i:295 RG:Z:Sample_11_R1 MQ:i:30 MC:Z:73M16I40M
ATTTCCGTACAATCAACACTTAGGTTTT_molbar_1 99 chr8 117878784 30 73M16I40M = 117878857 222 ACAATCAACAAAGTGAATAAAAATGTCAACATCAAACATACCTTTGGTGAGATGATACTCTCCACGCTGCTCTCTAAATTACACTCGAACTAAATTACACTCGAACACATGGGCTTTGGTTAGCTTCTT GHHHHHHHHHHHHHHHHHHHHHHHHHHHFHHHHHHGHGGGGGHHHHHGGHHHHHGGGHHHGCEGGGGGHGFFFFGGCHHGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG YA:Z:chr8:117878503:354M16I431M MC:Z:1I149M MD:Z:113 RG:Z:Sample_11_R1 XG:i:16 NM:i:16 XM:i:0XN:i:0 XO:i:1 MQ:i:30 AS:i:205 YS:i:-5 YT:Z:CP
ATTTCCGTACAATCAACACTTAGGTTTT_molbar_1 99 chr8 117878784 30 73M16I40M = 117878856 222 ACAATCAACAAAGTGAATAAAAATGTCAACATCAAACATACCTTTGGTGAGATGATACTCTCCACGCTGCTCTCTAAATTACACTCGAACTAAATTACACTCGAACACATGGGCTTTGGTTAGCTTCTT GHHHHHHHHHHHHHHHHHHHHHHHHHHHFHHHHHHGHGGGGGHHHHHGGHHHHHGGGHHHGCEGGGGGHGFFFFGGCHHGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG XN:i:0 XM:i:0 XO:i:1 XG:i:16 NM:i:16 MD:Z:113 YS:i:-5 YT:Z:CP AS:i:205 RG:Z:Sample_11_R1 MQ:i:30 MC:Z:150M
Is this behavior expected? I noticed Issue 29, posted by a former colleague of mine, but the data in that case seems to have had alternate reads/alignments in the input.
The DNA command line used for Abra2 is identical to the Abra command line, after accounting for the change from --working to --tmpdir and a larger java heap.
java -Xmx16G -jar abra2.jar --in Sample_11_R1.molbar.trimmed.deduped.masked.snv.nuc_counts.variants.bam --out Sample_11_R1.molbar.trimmed.deduped.masked.snv.nuc_counts.variants.abra2.bam --ref hg19.fa --targets tmp0BgLKG.abraroi.bed --threads 1 --tmpdir tmpCUTlEb_IndelRealigner
I've attached the Abra2 debug log which records specific processing at the alternate, upstream alignment position of chr8:117878503.
abra2_debug.log
.
.
.
At them moment, I'm planning to rollback the DNA workflow to use Abra but plan to continue using Abra2 with duplicate filtering for RNA. I assume the same read duplication is possible for RNA? And the filtering on the YA tag is a reasonable solution?
Additional analysis has tied the problem to custom operations downstream of Abra2. I've not identified the specific incompatibility between our code and Abra2 but I'll post a follow up if the conclusion seems useful.
Abra2 remained the concern for longer than it should have because of workflow file handling that obscured the chain of events. Apologies if you invested time in this.
Thanks for the software and your time.