skovaka/UNCALLED

Generating sequencing summary from fast5 raw reads

maximilianmordig opened this issue · 1 comments

Hi @skovaka
Thank you for developing UNCALLED.

I am wondering how to generate the sequence_summary file that is necessary to run the "uncalled sim" command as described in the README: /path/to/control/fast5s --ctl-seqsum /path/to/control/sequencing_summary.txt. These files don't seem to be provided.
So I have downloaded some E. coli fast5 raw reads, but they unfortunately don't come with the sequencing_summary.txt. To my understanding, the control fast5 files are only used to have the fast5 raw signal in the simulation, so I am also wondering why it relies on fields such as template_duration which is basecaller specific.

Thank you.

We mainly use the sequencing summary to infer the timing between reads on each channel. This information is present in the fast5s as well, but parsing through every fast5 file takes much much longer than reading one text file. We also use the template start and duration in order to trim the adapter sequence and any noisy signal from each end of the reads. The ReadUntil API is able to do this in real-time, and the sequencing summary was the best/easiest way I could find to mimic that behavior. So, you are correct that it should be possible to simulate without a sequencing summary, but it would take some effort to work around those issues.

Some example sequencing summaries from human and a mock microbial community are available here: https://labshare.cshl.edu/shares/schatzlab/www-data/UNCALLED/simulator_files/