mourisl/Lighter

Error correction fails with small genome sizes

Closed this issue · 6 comments

bede commented

Lighter doesn't appear to like small genomes. Here I'm using ~4m reads and a 10kb genome size, with which Lighter does nothing. If I increase the genome size to a megabase, the report looks more as I'd expect. Is this a bug or a feature? Tested on Ubuntu and Yosemite.

10kb genome size:

bede-rmbp:lighter $ ./lighter -k 11 10000 0.1 -r H140260768-1B.R1.fastq -r H140260768-1B.R2.fastq -t 2
[2014-12-04 12:09:31] =============Start====================
[2014-12-04 12:09:32] Bad quality threshold is #
[2014-12-04 12:09:37] Finish sampling kmers
[2014-12-04 12:09:37] Bloom filter A's error rate: 0.998423
[2014-12-04 12:09:50] Finish storing trusted kmers
[2014-12-04 12:10:06] Finish error correction
Processed 3782706 reads:
    0 are error-free
    Corrected 0 bases(0.000000 corrections for reads with errors)
    Trimmed 0 reads with average trimmed bases 0.000000
    Discard 0 reads

1Mb genome size:

bede-rmbp:lighter $ ./lighter -k 11 1000000 0.1 -r H140260768-1B.R1.fastq -r H140260768-1B.R2.fastq -t 2
[2014-12-04 12:34:32] =============Start====================
[2014-12-04 12:34:33] Bad quality threshold is "#"
[2014-12-04 12:34:38] Finish sampling kmers
[2014-12-04 12:34:38] Bloom filter A's error rate: 0.001192
[2014-12-04 12:35:07] Finish storing trusted kmers
[2014-12-04 12:35:26] Finish error correction
Processed 3782706 reads:
    3543702 are error-free
    Corrected 233998 bases(0.979055 corrections for reads with errors)
    Trimmed 0 reads with average trimmed bases 0.000000
    Discard 0 reads

Li: could this be related to how alpha is being set? The README says "A rule of thumb is that alpha=(7/C), where C is the coverage of the data set." But that's a little unclear. Does "coverage" mean average coverage? Total # of nucleotides in the input? Some more help in the documentation, and perhaps a tool that given the input reads automatically picks a good value for alpha, would be very helpful to users.

Yes, this seems to relate the value of alpha. From the number of reads and the genome size, the average coverage of the dataset set should be at least 400x (if the reads length is 1). If the read length is 100bp, then the average coverage is 40000x and the alpha should be 7/(40000)=0.000175. In this case, the option for -k should be like "-k 11 10000 0.000175".

Please let us know whether changing the alpha(sampling rate) would work on your data set.

And I'm working on introducing the feature which the user only need to provide the kmer size to Lighter.

bede commented

I'm sorry Li and Ben – I had overlooked this rule of thumb for what is evidently a critical and sensitive parameter. Perhaps until you've implemented automatic alpha selection you could mention something about setting alpha in the executable's built-in help.

Ridiculous speed!

[2014-12-04 15:33:07] =============Start====================
[2014-12-04 15:33:07] Bad quality threshold is "#"
[2014-12-04 15:33:09] Finish sampling kmers
[2014-12-04 15:33:09] Bloom filter A's error rate: 0.000000
[2014-12-04 15:33:32] Finish storing trusted kmers
[2014-12-04 15:33:58] Finish error correction
Processed 3782706 reads:
    1005358 are error-free
    Corrected 2088177 bases(0.751860 corrections for reads with errors)
    Trimmed 0 reads with average trimmed bases 0.000000
    Discard 0 reads

real    0m51.012s
user    1m57.975s
sys 0m41.569s

Many thanks,
Bede

bede commented

Trivial PR: #9

Thanks! We've merged your PR.

bede commented

:)