subsampler

Question

subsampler

Opened this issue 9 years ago · 3 comments

Why RAMPART divides for 10.000 (2 times 100) if the probability to pick a read is greater than 1?

In the following case I had a probability 108.6% to be picked, but the log of subsampler tells that the probability is 0.010018 instead 1.0018. Logically I expected to have a file with the actual number of the reads plus a subsample of 1.0018% of "resampled" reads.

Estimated that library: Sample; has approximately 404854224 bases in both files. Estimated genome size is: 2200000; so actual coverage (per file) is approximately: 92.01232363636363; we will only keep 108.68109406214322% of the reads in each file to achieve approximately 200X coverage in both files

cat RAMPART20160912_171111-group-spades-raw-subsample-Sample_200x-file2.log
Subsampler
Seed 519289336
Readed 2029342
Printed 20330
Probability 0.010018
DONE

Thanks,
Alessandro

Answer 1 · 2016-09-13T14:22:06.000Z

I suspect that the issue here is that there is a bug when the specified coverage indicates that we should oversample rather than subsample due to lack of input data. Normally oversampling won't help to improve your assembly. I should put a check in the code to stop processing if the user requests more coverage than is present in the input data.

Answer 2 · 2016-09-13T15:03:39.000Z

"stop processing" means stop that particular "oversampled" coverage or means stops the entire pipeline?
I would agree with the first option.

Answer 3 · 2016-09-13T15:10:34.000Z

Good point. Yes, the best behaviour would be to put out a warning message, disable the over/subsampling, and then continue with the complete dataset.