Kingsford-Group/kourami

Duplicate reads in BAM

Opened this issue · 2 comments

Hi,

This tool is very useful in my workflow but I have a question about the preprocessing step.

The chr6 coordinates overlap causing the final alignment to have duplicate primary alignments for the same read. The number of primary alignments matches the number of times the filter step used "samtools view chr6:START1-END1...chr6:STARTn-ENDn" .

For example, if I use this process to filter reads it duplicats reads in the first samtools view and then duplicats them a second time for the second samtools extraction. These four reads then have primary alignments to different contigs in the panel (each of which has a flag of 83 or 163)
https://github.com/Kingsford-Group/kourami/blob/master/scripts/alignAndExtract_hs38DH.sh

A00117:192:HFGFGDMXX:1:1339:16486:16830	83	DPB1*266:01	5085	0	76M	=	5042	-119		NM:i:0	MD:Z:76	AS:i:76	XS:i:76
A00117:192:HFGFGDMXX:1:1339:16486:16830	163	DPB1*266:01	5042	0	76M	=	5085	119		NM:i:0	MD:Z:76	AS:i:76	XS:i:76
A00117:192:HFGFGDMXX:1:1339:16486:16830	83	DPB1*379:01	5085	0	76M	=	5042	-119		NM:i:0	MD:Z:76	AS:i:76	XS:i:76
A00117:192:HFGFGDMXX:1:1339:16486:16830	163	DPB1*379:01	5042	0	76M	=	5085	119		NM:i:0	MD:Z:76	AS:i:76	XS:i:76
A00117:192:HFGFGDMXX:1:1339:16486:16830	83	DPB1*351:01	5085	0	76M	=	5042	-119		NM:i:0	MD:Z:76	AS:i:76	XS:i:76
A00117:192:HFGFGDMXX:1:1339:16486:16830	163	DPB1*351:01	5042	0	76M	=	5085	119		NM:i:0	MD:Z:76	AS:i:76	XS:i:76
A00117:192:HFGFGDMXX:1:1339:16486:16830	83	DPB1*20:01:02	5085	0	76M	=	5042	-119		NM:i:0	MD:Z:76	AS:i:76	XS:i:76
A00117:192:HFGFGDMXX:1:1339:16486:16830	163	DPB1*20:01:02	5042	0	76M	=	5085	119		NM:i:0	MD:Z:76	AS:i:76	XS:i:76

The number of alignments is arbitrarily large depending on how many times the read was included in the FASTQ. They can also align multiple times to the same contig. Should I actually be creating a BAM with all possible alignments or with one primary alignment only?

BAM with one primary alignment should work fine as one primary alignment will be projected on to our graph representation of alleles. My guess is that the original bam file that you have duplicated entries for this read.

Thanks @heewookl, the overlapping regions will pull the same primary read once for each of the regions in the overlap (leading to multiple primary alignments in the output). I have a workflow that corrects this at https://bitbucket.nygenome.org/projects/COMPBIO/repos/hla_prep/browse