wdecoster/cramino

Should --ubam really include secondary alignments?

fellen31 opened this issue · 3 comments

The --ubam flag will provide metrics for all reads in the file, regardless of whether they are aligned or not

When running with --ubam on a file with aligned reads, secondary alignments will get a length of 0 and be included in the median read length calculations, arrow output etc.

test-data/small-test-phased.bam

3fda06e9-62ef-4448-9993-b90124a793d5    0       chr7    152743763       60      52S53M1I40M2I9M3I654M1D304M2I53>
19d9337f-4fb6-46e5-b484-14d05f562506    16      chr7    152743766       60      99M1D599M6D86M1D93M2D4M1D5M1I79>
35febf09-dcbc-424c-987e-9f3f80fe73a5    16      chr7    152748627       60      7S304M2I78M1D306M1D440M3I350M1I>
34884519-a6b4-4cf9-9c07-0822ab6d199d    16      chr7    152755375       60      393M1I105M1I7M2I766M2D145M1I860>
13dbd4b1-d897-42ac-9a9c-a575ddbb70d5    272     chr7    152762381       0       1503S363M1I3M1D566M13D97M1I509M>
3597be79-ad8a-462b-aea9-234c35cad0c8    16      chr7    152763666       60      14190S651M2D325M2D359M2D410M1I2>
d49d98b0-50ce-46a6-a316-e122906e4582    16      chr7    152766418       60      301M1I3M1D371M1D319M1D2M1D1205M>
896f0cb5-0863-4180-818c-c1d2ef49beec    256     chr7    152768160       0       37S175M1I69M1I2M2I28M1I9M1I6M2I

out.arrow

 lengths
     <int>
 1   46025
 2   46029
 3   33090
 4   39496
 5       0
 6   45333
 7   37036
 8       0

If not also running with --min-read-len 1.

 lengths
     <int>
 1   46025
 2   46029
 3   33090
 4   39496
 5   45333
 6   37036
 7   18865
 8   28644

However, running with --ubam will no longer removed soft-clipped bases from the primary alignments.
Compare the output to running without --ubam.

 identities lengths
        <dbl>   <int>
 1       99.2   45973
 2       99.1   45992
 3       99.3   33043
 4       99.1   39464
 5       99.0   16780
 6       99.2   36999
 7       99.3   18802
 8       99.1   15550

If secondary alignments are filtered out by default, should they also not be filtered out by default when running with --ubam?
And should soft-clipped bases then not be removed from primary alignments when running with --ubam?

I think the problem is that if you are running --ubam on an actual ubam, there is no such thing as secondary or softclipped. The intention for --ubam is to represent all the reads, in the way they come from the sequencer. Having reads with a length of 0 is definitely not desirable, though. Any thoughts on how this could be improved?

I have added a suggestion in #30.

One thing is to think about if you have an aligned BAM with some unaligned reads. For example, add 100 unaligned reads to test-data/small-test-phased.bam, and #30 will report 7416 alignments without --ubam and 7516 alignments with --ubam, while the current version reports 7416 without and 8205 alignments with (since it includes unaligned reads and secondary alignments).

Another thing is if you would rather want to represent all alignments in an aligned BAM as unaligned (like running samtools reset), but I think that is a different flag than --ubam.

samtools reset small-test-phased.bam | grep -v '^@^' | wc -l
6180

--ubam will and already functions properly on a BAM that only contains unaligned reads you say, since there you never have any secondary reads or soft clipped bases.

Yeh I think you are right, there are indeed two scenarios to keep in mind.

  • aligned BAM with some unaligned reads (a common thing)
  • actual unaligned bam

I agree a different flag makes most sense then...