subsampler
Opened this issue · 3 comments
Why RAMPART divides for 10.000 (2 times 100) if the probability to pick a read is greater than 1?
In the following case I had a probability 108.6% to be picked, but the log of subsampler tells that the probability is 0.010018 instead 1.0018. Logically I expected to have a file with the actual number of the reads plus a subsample of 1.0018% of "resampled" reads.
Estimated that library: Sample; has approximately 404854224 bases in both files. Estimated genome size is: 2200000; so actual coverage (per file) is approximately: 92.01232363636363; we will only keep 108.68109406214322% of the reads in each file to achieve approximately 200X coverage in both files
cat RAMPART20160912_171111-group-spades-raw-subsample-Sample_200x-file2.log
Subsampler
Seed 519289336
Readed 2029342
Printed 20330
Probability 0.010018
DONE
Thanks,
Alessandro
I suspect that the issue here is that there is a bug when the specified coverage indicates that we should oversample rather than subsample due to lack of input data. Normally oversampling won't help to improve your assembly. I should put a check in the code to stop processing if the user requests more coverage than is present in the input data.
"stop processing" means stop that particular "oversampled" coverage or means stops the entire pipeline?
I would agree with the first option.
Good point. Yes, the best behaviour would be to put out a warning message, disable the over/subsampling, and then continue with the complete dataset.