bug (?): unexpected results with enforced interval size greater than default
eboyden opened this issue · 4 comments
When I aligned a pair of fastqs using BWA-MEM2, Bowtie2, or SNAP, and an enforced interval size of 0-500, I got similar results. When I aligned with BWA-MEM2 or Bowtie2 and an enforced interval size of 50-500, the results didn't change much; very few on-target alignments had intervals <50 bp. But when I aligned with SNAP and an enforced (using -fs
) interval size of 50-500, many fewer reads aligned. Furthermore, here is an example of a read pair that no longer aligned with a minimum interval of 50 bp, despite the fact that it's a 120 bp properly paired interval:
MN01688:12:000H3MG5N:1:23106:10622:10676 99 chr1 215878739 70 119M = 215878739 120 AAAAATCATAGTCACCTTCTCTTACCTCAAATTAGGTCCATTTGGCTTGGATGGTGGTTGCCAAGAAATCACAACATATGATTCACTTAGTGGAATCACAGACAATGGGCCAACATTCT FFFFFFFFFFFFFFFFFFFFFFFFFFFFAFFFFFFFFFFFFFFFFAFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFAFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF LB:Z:lb MC:Z:120M MD:Z:119 PG:Z:SNAP RG:Z:rg PL:Z:ILLUMINA NM:i:0 SM:Z:sm MQ:i:70 UQ:i:0 QS:i:4389 PU:Z:pu ms:i:4389
MN01688:12:000H3MG5N:1:23106:10622:10676 147 chr1 215878739 70 120M = 215878739 -120 AAAAATCATAGTCACCTTCTCTTACCTCAAATTAGGTCCATTTGGCTTGGATGGTGGTTGCCAAGAAATCACAACATATGATTCACTTAGTGGAATCACAGACAATGGGCCAACATTCTG FFFFFFFFFFFFFFFFFFFF=FFFFFFFFFFAFFFFFFFFFFFFFFFFFFFFFFFFFF/FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF LB:Z:lb MC:Z:119M MD:Z:120 PG:Z:SNAP RG:Z:rg PL:Z:ILLUMINA NM:i:0 SM:Z:sm MQ:i:70 UQ:i:0 QS:i:4388 PU:Z:pu ms:i:4388
I also recapitulated this on the NA12878 data NIST7035_TAAGGCGA; the alignment rate fell from 97% to 59%. And it only seems to happen with the -fs
option.
samtools flagstat reports:
-s 0 500
:
39731146 + 0 in total (QC-passed reads + QC-failed reads)
39731146 + 0 primary
0 + 0 secondary
0 + 0 supplementary
3114693 + 0 duplicates
3114693 + 0 primary duplicates
39079698 + 0 mapped (98.36% : N/A)
39079698 + 0 primary mapped (98.36% : N/A)
39731146 + 0 paired in sequencing
19865573 + 0 read1
19865573 + 0 read2
38521349 + 0 properly paired (96.96% : N/A)
38615594 + 0 with itself and mate mapped
464104 + 0 singletons (1.17% : N/A)
19728 + 0 with mate mapped to a different chr
14333 + 0 with mate mapped to a different chr (mapQ>=5)
-s 0 500 -fs
(similar results):
39731146 + 0 in total (QC-passed reads + QC-failed reads)
39731146 + 0 primary
0 + 0 secondary
0 + 0 supplementary
2899310 + 0 duplicates
2899310 + 0 primary duplicates
38627431 + 0 mapped (97.22% : N/A)
38627431 + 0 primary mapped (97.22% : N/A)
39731146 + 0 paired in sequencing
19865573 + 0 read1
19865573 + 0 read2
38533349 + 0 properly paired (96.99% : N/A)
38556972 + 0 with itself and mate mapped
70459 + 0 singletons (0.18% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
-s 50 500
(similar results):
39731146 + 0 in total (QC-passed reads + QC-failed reads)
39731146 + 0 primary
0 + 0 secondary
0 + 0 supplementary
3144369 + 0 duplicates
3144369 + 0 primary duplicates
39074551 + 0 mapped (98.35% : N/A)
39074551 + 0 primary mapped (98.35% : N/A)
39731146 + 0 paired in sequencing
19865573 + 0 read1
19865573 + 0 read2
23181753 + 0 properly paired (58.35% : N/A)
38607952 + 0 with itself and mate mapped
466599 + 0 singletons (1.17% : N/A)
333828 + 0 with mate mapped to a different chr
47673 + 0 with mate mapped to a different chr (mapQ>=5)
-s 50 500 -fs
(abnormally low alignment rate):
39731146 + 0 in total (QC-passed reads + QC-failed reads)
39731146 + 0 primary
0 + 0 secondary
0 + 0 supplementary
1658040 + 0 duplicates
1658040 + 0 primary duplicates
23583000 + 0 mapped (59.36% : N/A)
23583000 + 0 primary mapped (59.36% : N/A)
39731146 + 0 paired in sequencing
19865573 + 0 read1
19865573 + 0 read2
23344882 + 0 properly paired (58.76% : N/A)
23512538 + 0 with itself and mate mapped
70462 + 0 singletons (0.18% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
For comparison:
Bowtie2 --very-fast-local -I 0 --no-discordant
(default):
39731146 + 0 in total (QC-passed reads + QC-failed reads)
39731146 + 0 primary
0 + 0 secondary
0 + 0 supplementary
2509954 + 0 duplicates
2509954 + 0 primary duplicates
39685464 + 0 mapped (99.89% : N/A)
39685464 + 0 primary mapped (99.89% : N/A)
39731146 + 0 paired in sequencing
19865573 + 0 read1
19865573 + 0 read2
39514079 + 0 properly paired (99.45% : N/A)
39653236 + 0 with itself and mate mapped
32228 + 0 singletons (0.08% : N/A)
13340 + 0 with mate mapped to a different chr
8318 + 0 with mate mapped to a different chr (mapQ>=5)
Bowtie2 --very-fast-local -I 50 --no-discordant
:
39731146 + 0 in total (QC-passed reads + QC-failed reads)
39731146 + 0 primary
0 + 0 secondary
0 + 0 supplementary
2510277 + 0 duplicates
2510277 + 0 primary duplicates
39685267 + 0 mapped (99.88% : N/A)
39685267 + 0 primary mapped (99.88% : N/A)
39731146 + 0 paired in sequencing
19865573 + 0 read1
19865573 + 0 read2
39382600 + 0 properly paired (99.12% : N/A)
39652842 + 0 with itself and mate mapped
32425 + 0 singletons (0.08% : N/A)
19408 + 0 with mate mapped to a different chr
8593 + 0 with mate mapped to a different chr (mapQ>=5)
Reviewing this a few months later, I noticed that although the alignment rate with -s 50 500
is high (98.35%), the "properly paired" rate is low (only 58.35%), which suggests that nearly half of the alignments have short intervals. So adding -fs
to the command is simply preventing those "improperly paired" alignments from occurring. So the problem seems to be with the minimum insert size, not the -fs
option.
I wonder if SNAP computes the interval differently than the other aligners. I'll take a look at it and see what's up.
Yeah I think you're right - I just took another look at alignments with either -s 0 500 -fs
or -s 50 500 -fs
, and in the latter case not only did the number of alignments with intervals >=50 sharply decrease, I also still observed plenty of alignments with intervals <50. Note that I'm basing interval size on column 9 of the sam file (TLEN).
I'll figure it out and try to harmonize what we're doing with the other aligners.