Incorrect read length distribution produced when read length close to reference length

Question

Incorrect read length distribution produced when read length close to reference length

amandawarr opened this issue 3 years ago · 1 comments

Describe the bug
We are trying to simulate "very good" sequencing data for an amplicon that is almost the length of the ~15kb genome we are sequencing, but the read length distributions coming out of badread are not as expected. With a mean of 14kb and a low std (10) we get a histogram like this:

And with a high std (13000) we get a distribution like this:

The histograms printed to stdout during the run appear as expected, but the data out of the end doesn't reflect them.

To reproduce
The genome used is here: https://www.ncbi.nlm.nih.gov/nuccore/1695217306
The command used is:
badread simulate --reference Lelystad.fasta --quantity 800x --length 14000,10 --error_model random --qscore_model ideal --glitches 0,0,0 --junk_reads 0 --random_reads 0 --chimeras 0 --identity 95,100,4 --start_adapter_seq "" --end_adapter_seq ""

Answer 1 · 2023-04-20T06:31:01.000Z

Short version: Badread's read lengths may not respect the given distribution very well on short linear reference sequences.

Long version:
This is a tough one! Badread randomly chooses a place on the sequence to start each read, and when the reference is linear (as I think yours is), that means reads which start near the end of the sequence can be truncated. For long reference sequences, this only affects a small proportion of reads, but for short references, it can affect a lot of reads, messing up the read length distribution.

I could potentially fix that problem by restricting where Badread can start a read, ensuring that there is enough room for the read to finish, but this has another effect: it makes the read depth fall off toward the ends of the sequence, which is unrealistic for linear replicons.

So I think I have a trade-off between respecting the read length distribution vs ensuring even depth. And since read length distributions will definitely be wrong when they are too long for the replicon (e.g. mean 14 kbp and stdev 13 kbp for a 15 kbp sequence), I'll choose to prioritise even depth. I.e. this behaviour is an inherent limitation of how Badread works 😦

If you a tighter read length distribution than Badread is providing, I think it will be necessary to filter based on read length after it finishes. Sorry!

Ryan