mozack/abra2

Alternate or duplicate alignment in output

jgaupp-invitae opened this issue · 1 comments

Hi,

I've got a question about duplicate reads seen in Abra2 (v. 2.23) output during regression testing. For context, I've recently replaced Abra with Abra2 in a long standing RNA/DNA variant calling workflow. The move to Abra2 was prompted by a switch to STAR for RNA alignment. All of the DNA workflow, with exception of Abra, remained the same. A handful of DNA based regression tests "failed" with slight improvements in variant allele frequencies. However, one test failed with a lost variant. After a little investigation, I found that a post-alignment counting step was balking at duplicate reads created by Abra2.

In the example below, a read pair from the input bam is ouptut by Abra2 as a pair plus an alternate, sense strand read with an upstream aligned YA tag added (YA:Z:chr8:117878503:354M16I431M).

Input Bam
---------
ATTTCCGTACAATCAACACTTAGGTTTT_molbar_1	147	chr8	117878856	30	150M	=	117878784	-222	ACTAAATTACACTCGAACACATGGGCTTTGGTTAGCTTCTTATCCCAATGGGCCGCTAGCCAAATTTTGGCCAGAGGCCCTCTTTTACTGAGAACAAAATGTGCGTAGAACATTGTTCTGGCTGGCTATGAAAACAGAAGAAAACCTAAG	GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGHHCGGFFFFGHGGGGGECGHHHGGGHHHHHGGHHHHHGGGGGHGHHHHHHFHHHHHHHHHHHHHHHHHHH!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!	XN:i:0	XM:i:1	XO:i:0	XG:i:0	NM:i:1	MD:Z:0T149	YS:i:-53	YT:Z:CP	AS:i:295	RG:Z:Sample_11_R1	MQ:i:30	MC:Z:73M16I40M

ATTTCCGTACAATCAACACTTAGGTTTT_molbar_1	99	chr8	117878784	30	73M16I40M	=	117878856	222	ACAATCAACAAAGTGAATAAAAATGTCAACATCAAACATACCTTTGGTGAGATGATACTCTCCACGCTGCTCTCTAAATTACACTCGAACTAAATTACACTCGAACACATGGGCTTTGGTTAGCTTCTT	GHHHHHHHHHHHHHHHHHHHHHHHHHHHFHHHHHHGHGGGGGHHHHHGGHHHHHGGGHHHGCEGGGGGHGFFFFGGCHHGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG	                                        XN:i:0	XM:i:0	XO:i:1	XG:i:16	NM:i:16	MD:Z:113	YS:i:-5	YT:Z:CP	AS:i:205	RG:Z:Sample_11_R1	MQ:i:30	MC:Z:150M

Output Bam
----------
ATTTCCGTACAATCAACACTTAGGTTTT_molbar_1	147	chr8	117878856	30	150M	=	117878784	-222	ACTAAATTACACTCGAACACATGGGCTTTGGTTAGCTTCTTATCCCAATGGGCCGCTAGCCAAATTTTGGCCAGAGGCCCTCTTTTACTGAGAACAAAATGTGCGTAGAACATTGTTCTGGCTGGCTATGAAAACAGAAGAAAACCTAAG	GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGHHCGGFFFFGHGGGGGECGHHHGGGHHHHHGGHHHHHGGGGGHGHHHHHHFHHHHHHHHHHHHHHHHHHH!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!	XN:i:0	XM:i:1	XO:i:0	XG:i:0	NM:i:1	MD:Z:0T149	YS:i:-53	YT:Z:CP	AS:i:295	RG:Z:Sample_11_R1	MQ:i:30	MC:Z:73M16I40M

ATTTCCGTACAATCAACACTTAGGTTTT_molbar_1	99	chr8	117878784	30	73M16I40M	=	117878857	222	ACAATCAACAAAGTGAATAAAAATGTCAACATCAAACATACCTTTGGTGAGATGATACTCTCCACGCTGCTCTCTAAATTACACTCGAACTAAATTACACTCGAACACATGGGCTTTGGTTAGCTTCTT	GHHHHHHHHHHHHHHHHHHHHHHHHHHHFHHHHHHGHGGGGGHHHHHGGHHHHHGGGHHHGCEGGGGGHGFFFFGGCHHGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG	YA:Z:chr8:117878503:354M16I431M	MC:Z:1I149M	MD:Z:113	RG:Z:Sample_11_R1	XG:i:16	NM:i:16	XM:i:0XN:i:0	XO:i:1	MQ:i:30	AS:i:205	YS:i:-5	YT:Z:CP

ATTTCCGTACAATCAACACTTAGGTTTT_molbar_1	99	chr8	117878784	30	73M16I40M	=	117878856	222	ACAATCAACAAAGTGAATAAAAATGTCAACATCAAACATACCTTTGGTGAGATGATACTCTCCACGCTGCTCTCTAAATTACACTCGAACTAAATTACACTCGAACACATGGGCTTTGGTTAGCTTCTT	GHHHHHHHHHHHHHHHHHHHHHHHHHHHFHHHHHHGHGGGGGHHHHHGGHHHHHGGGHHHGCEGGGGGHGFFFFGGCHHGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG	                                        XN:i:0	XM:i:0	XO:i:1	XG:i:16	NM:i:16	MD:Z:113	YS:i:-5	YT:Z:CP	AS:i:205	RG:Z:Sample_11_R1	MQ:i:30	MC:Z:150M

Is this behavior expected? I noticed Issue 29, posted by a former colleague of mine, but the data in that case seems to have had alternate reads/alignments in the input.

The DNA command line used for Abra2 is identical to the Abra command line, after accounting for the change from --working to --tmpdir and a larger java heap.

java -Xmx16G -jar abra2.jar --in Sample_11_R1.molbar.trimmed.deduped.masked.snv.nuc_counts.variants.bam --out Sample_11_R1.molbar.trimmed.deduped.masked.snv.nuc_counts.variants.abra2.bam --ref hg19.fa --targets tmp0BgLKG.abraroi.bed --threads 1 --tmpdir tmpCUTlEb_IndelRealigner

I've attached the Abra2 debug log which records specific processing at the alternate, upstream alignment position of chr8:117878503.

abra2_debug.log
.
.
.
At them moment, I'm planning to rollback the DNA workflow to use Abra but plan to continue using Abra2 with duplicate filtering for RNA. I assume the same read duplication is possible for RNA? And the filtering on the YA tag is a reasonable solution?

Additional analysis has tied the problem to custom operations downstream of Abra2. I've not identified the specific incompatibility between our code and Abra2 but I'll post a follow up if the conclusion seems useful.

Abra2 remained the concern for longer than it should have because of workflow file handling that obscured the chain of events. Apologies if you invested time in this.

Thanks for the software and your time.