Feature: Min/max coverage threshold

Question

Feature: Min/max coverage threshold

bbalog87 opened this issue 3 years ago · 2 comments

Hi, thank you for the great tool!

Sometimes, we need to subsample long reads (e.g HiFi) to retain only reads that are at least x times present in the reads set.
Let's say I need to keep only reads with at least 40x coverage. This is particularly crucial to automatically exclude all erroneous reads/k-mers with potential contaminations. Because errors and contaminations do not usually have deep coverage in the genome. That is why erroneous k-mers are always at the beginning of the k-mers histograms with a low depth, usually < 20. Having an option to subsample only reads with a min coverage would maybe exclude most erroneous reads and contaminations, hence improving the assembly for instance.

Answer 1 · 2022-01-23T22:07:10.000Z

Correct me if I'm wrong, but to do this you would need an alignment (i.e., BAM/SAM file)? Because to know if two reads constitute 2x coverage, you need to know they come from the same part of the genome correct?

If that is the case, I don't currently support alignment subsampling (and likely wont), but #17 (comment) links to a tool that can randomly subsample an alignment.

If I have the wrong idea could you please clarify?

Answer 2 · 2022-01-24T00:49:02.000Z

Hi @mbhall88
You are completely right. That was also my thought. Most of the tools out there including seqtk, seqkit etc. only subsample (randomly) a subset of reads to reduce the overall coverage. This is what is needed most of the time. Alignment-based subsampling is only a special usecase.

Thanks for the clarification and for the hint on issue 17.