Question about unassigned reads when running in alignment mode
Opened this issue · 1 comments
This is not a bug but more of a question. I've run Salmon in alignment mode with transcriptome BAM file generated by STAR. The BAM file contains no unaligned reads. My question is there are often a small number of reads that were not assigned to any rich equivalence class. I am trying to understand what these reads are. I notice that this only happens when the input is paired-end reads. I suspect maybe the unassigned reads are dovetail paired end reads, but I don't know. The --allowDovetail
option is not available in alignment mode. Here is an excerpt of the log:
Completed first pass through the alignment file.
Total # of mapped reads : 6205189
# of uniquely mapped reads : 1718004
# ambiguously mapped reads : 4487185
[2024-08-14 18:21:52.491] [jointLog] [info] Computed 350358 rich equivalence classes for further processing
[2024-08-14 18:21:52.491] [jointLog] [info] Counted 6192944 total reads in the equivalence classes
As you can see 6192944 out of 6205189 reads were assigned to rich equivalence classes.
It would be nice to know what the excluded reads are, and/or if there are options to rescue these reads, similar to --allowDovetail
.
This is Salmon version 1.10.3, but I also ran older version, which generated same results.
I realized that most of these unassigned reads are probably paired-end reads that didn't match the specified the libType, which was "IU", or inward, not stranded. So I ran samtools stats
on my BAM file to verify that.
SN inward oriented pairs: 6191674
SN outward oriented pairs: 13515
The inward pairs 6191674 is close to the pairs Salmon assigned, which was 6192944, but not the same. That's OK, considering Salmon and samtools probably have different ways of defining inward, outward read pairs.
I think it's helpful if Salmon can say in the log how many reads were excluded, for what reason. Thanks.