wdecoster/nanostat

runtime stats

Closed this issue · 9 comments

It would be great if the summary includes runtime stats as well.

Problem here is that the summary text file contains the start_time as offset value from the real date/time value provided in the fastq file. Hence, the total runtime computed out of the values in the summary file is not correct because for each run_id (mux scan, restarts, etc.) the offset starts at 0 again.

Unfortunately, nanostat does not accept both the summary and the fastq file at the same time.

I'm not sure which runtime stats you would like.

The summary files I have here, generated by basecalling with guppy, are correct.

I'm not sure which runtime stats you would like.

start and stop time, total runtime (stop - start), and usage time (sum of runtime per run_id).

The summary files I have here, generated by basecalling with guppy, are correct.

I have summary files produced by albacore v2.3.1. They contain real numbers in the start_time column. In the fastq header the start_time field contains timestamps.

Based on the real numbers at least the usage time (runtime of all run_ids) can be computed.

You are handling the different data types of the start_time field in nanoplot already. But the problem exists their too, which the following two graphs show:

produced with NanoPlot --summary sequencing_summary.txt ...
mi000_b0925_number_of_reads_over_time

produced with NanoPlot --fastq_minimal ...:
mi000_b0925_fqm_number_of_reads_over_time

I know, this examples are not related to NanoStat but to your base functions in nanopack in general.

Hence I suggest adding at least the parameter --fastq_minimal to NanoStat an additional parameter, not mutual exclusive (for NanoPlot as well) to --summary and parsing the timestamps from the fastq file.

Did you run basecalling twice, once per folder? I create plots from hundreds of summary files (PromethION) and the time information is correct: the start_time in the next summary file starts where the previous has stopped.

No, I ran albacore only once. Maybe the reason for this case is that the MinION run was interrupted. After restarting it, a new mux and sequencing run was created. I assume that albacore computes the real number in start_time of sequencing_summery.txt based on the start time of each sequencing run. Do you consider this? pycoQC does it.

Ah, yes, if the run was restarted I can imagine the sequencing_summary.txt is not correct. I'm not sure how pycoQC solves this?

pycoQC solves this by grouping over the runids. Indeed, it is impossible to sort the runs, just order by size or runtime.

I have put 'adding run_time metrics to nanostat report' to my to-do list, but as I'm writing my thesis it gets a fairly low priority for now.

For the other problem, using the fastq_minimal or fastq_rich input is going to be the way forward. I don't intend to make changes to how summaries are parsed in the near future.