Should --ubam really include secondary alignments?
fellen31 opened this issue · 3 comments
The --ubam flag will provide metrics for all reads in the file, regardless of whether they are aligned or not
When running with --ubam
on a file with aligned reads, secondary alignments will get a length of 0 and be included in the median read length calculations, arrow output etc.
test-data/small-test-phased.bam
3fda06e9-62ef-4448-9993-b90124a793d5 0 chr7 152743763 60 52S53M1I40M2I9M3I654M1D304M2I53>
19d9337f-4fb6-46e5-b484-14d05f562506 16 chr7 152743766 60 99M1D599M6D86M1D93M2D4M1D5M1I79>
35febf09-dcbc-424c-987e-9f3f80fe73a5 16 chr7 152748627 60 7S304M2I78M1D306M1D440M3I350M1I>
34884519-a6b4-4cf9-9c07-0822ab6d199d 16 chr7 152755375 60 393M1I105M1I7M2I766M2D145M1I860>
13dbd4b1-d897-42ac-9a9c-a575ddbb70d5 272 chr7 152762381 0 1503S363M1I3M1D566M13D97M1I509M>
3597be79-ad8a-462b-aea9-234c35cad0c8 16 chr7 152763666 60 14190S651M2D325M2D359M2D410M1I2>
d49d98b0-50ce-46a6-a316-e122906e4582 16 chr7 152766418 60 301M1I3M1D371M1D319M1D2M1D1205M>
896f0cb5-0863-4180-818c-c1d2ef49beec 256 chr7 152768160 0 37S175M1I69M1I2M2I28M1I9M1I6M2I
out.arrow
lengths
<int>
1 46025
2 46029
3 33090
4 39496
5 0
6 45333
7 37036
8 0
If not also running with --min-read-len 1
.
lengths
<int>
1 46025
2 46029
3 33090
4 39496
5 45333
6 37036
7 18865
8 28644
However, running with --ubam
will no longer removed soft-clipped bases from the primary alignments.
Compare the output to running without --ubam
.
identities lengths
<dbl> <int>
1 99.2 45973
2 99.1 45992
3 99.3 33043
4 99.1 39464
5 99.0 16780
6 99.2 36999
7 99.3 18802
8 99.1 15550
If secondary alignments are filtered out by default, should they also not be filtered out by default when running with --ubam
?
And should soft-clipped bases then not be removed from primary alignments when running with --ubam
?
I think the problem is that if you are running --ubam
on an actual ubam, there is no such thing as secondary or softclipped. The intention for --ubam
is to represent all the reads, in the way they come from the sequencer. Having reads with a length of 0 is definitely not desirable, though. Any thoughts on how this could be improved?
I have added a suggestion in #30.
One thing is to think about if you have an aligned BAM with some unaligned reads. For example, add 100 unaligned reads to test-data/small-test-phased.bam
, and #30 will report 7416 alignments without --ubam
and 7516 alignments with --ubam
, while the current version reports 7416 without and 8205 alignments with (since it includes unaligned reads and secondary alignments).
Another thing is if you would rather want to represent all alignments in an aligned BAM as unaligned (like running samtools reset), but I think that is a different flag than --ubam
.
samtools reset small-test-phased.bam | grep -v '^@^' | wc -l
6180
--ubam
will and already functions properly on a BAM that only contains unaligned reads you say, since there you never have any secondary reads or soft clipped bases.
Yeh I think you are right, there are indeed two scenarios to keep in mind.
- aligned BAM with some unaligned reads (a common thing)
- actual unaligned bam
I agree a different flag makes most sense then...