mbhall88/rasusa

Relationship to filtlong

tseemann opened this issue · 2 comments

FYI - a comment from a colleague:

You can still use filtlong with the below settings to 
focus on quality only and to more or less ignore length in the scoring metric

--min_length 500
--mean_q_weight 10
--length_weight 1
--target_bases $((DEPTH * GENOMESIZE))

Yes, this is exactly how I have been using filtlong previously (exact same weights and all). There is still some filtering of read length happening here, which is a (subtle) bias. I am very keen to keep this project out of the filtering business as there are already great tools for this.
Removing the --min_length option here obviously is much more unbias, but still, there is a scoring system at work, which is not strictly random. In my experience with these weightings, it does not focus purely on quality, there is definitely still some length-favouring that happens. I guess my aim with rasusa was to provide as little parameters as possible. i.e. users don't need to play with scoring weights etc. Maybe I am being silly and everyone will keep using filtlong, which is also fine.

I have a section in the motivation where I mention how filtlong can be co-opted to do something similar. Do you think I need to provide better clarification around filtlong?

I don't know if this is also of interest, but in my local benchmarking rasusa was significantly faster than filtlong. But I don't feel comfortable focusing on this as I am not trying to compete with filtlong.

No worries - all good - thanks for explaination.