Counts in ONT data
Opened this issue · 1 comments
Hello dear Salmon developers,
First of all - thank you very much for your effort in supporting Oxford Nanopore reads! I've been Salmon for quantification of ONT sequencing experiments, and recently I decided to dive deeper into how it produces counts for ONT data. The release with initial ONT support (v1.5.1) states that counts should be 100 for all transcripts because at that time it was not clear how EffectiveLength should be computed. However, now (when using release v1.10.1) Salmon produces some meaningful count estimates. I tried to figure out the algorithm by looking at the code, but failed...
Is there a place where you have the said algorithm documented, or if not, could you please explain how is it implemented?
Thank you in advance!
Hello! I did some work with the oxford nanopore error model last summer. There's a blog post about the ONT long read quantification here: https://combine-lab.github.io/salmon-tutorials/2021/ont-long-read-quantification/ . In terms of length correction, the --ont flag basically turns off length correction (since it doesn't really apply to long reads). The error model that the current version of salmon uses for the --ont flag (found in src/ONTAlignmentModel.cpp) basically bins reads by length (into 4 bins by default, I believe). Then for each bin it learns a binomial/geometric distribution for the number of errors (mismatches or indels) in the alignment of the reads in the bin, as well as distributions for the number of bases softclipped at the beginning and end of the read. It then uses these models to penalize reads that have an amount of errors/softclips that is very different from the center of the learned distribution, only if the number of errors/softclips is larger than what we expect for that bin (since a smaller than expected number of errors in the alignment is generally a good, not a bad sign for how likely the read is to map to this transcript). I'm not the original author/creator of this model, so I don't have all the details on specifics of how it works/the design decisions that went into it, but let me know if you have any other questions I can answer!