Feature: Min/max coverage threshold
bbalog87 opened this issue · 2 comments
Hi, thank you for the great tool!
Sometimes, we need to subsample long reads (e.g HiFi) to retain only reads that are at least x times present in the reads set.
Let's say I need to keep only reads with at least 40x coverage. This is particularly crucial to automatically exclude all erroneous reads/k-mers with potential contaminations. Because errors and contaminations do not usually have deep coverage in the genome. That is why erroneous k-mers are always at the beginning of the k-mers histograms with a low depth, usually < 20. Having an option to subsample only reads with a min coverage would maybe exclude most erroneous reads and contaminations, hence improving the assembly for instance.
Correct me if I'm wrong, but to do this you would need an alignment (i.e., BAM/SAM file)? Because to know if two reads constitute 2x coverage, you need to know they come from the same part of the genome correct?
If that is the case, I don't currently support alignment subsampling (and likely wont), but #17 (comment) links to a tool that can randomly subsample an alignment.
If I have the wrong idea could you please clarify?
Hi @mbhall88
You are completely right. That was also my thought. Most of the tools out there including seqtk, seqkit etc. only subsample (randomly) a subset of reads to reduce the overall coverage. This is what is needed most of the time. Alignment-based subsampling is only a special usecase.
Thanks for the clarification and for the hint on issue 17.