nanoporetech/pod5-file-format

More info on some read table values

Opened this issue · 9 comments

Hey again,

I wanted to know a little more about the following fields.

num_minknow_events,
tracked_scaling_scale,
tracked_scaling_shift,
predicted_scaling_scale,
predicted_scaling_shift,
num_reads_since_mux_change,
time_since_mux_change,

Specifically:

num_minknow_events : What is this actually? is it the number of signal chunks in the signal table? or something else? What is an event? how are they detected? What is this used for?

tracked_scaling_scale,
tracked_scaling_shift,
predicted_scaling_scale,
predicted_scaling_shift,

These 4, I kinda understand what they are. But how are they actually calculated? are they used for anything? Do any ONT tools use them? if not, are there plans to use them/why are they captured and stored?

lastly

num_reads_since_mux_change,
time_since_mux_change,

I understand what these are, pretty self-explanatory, but why are they tracked? Are they used for something?

Converting from pod5->slow5 I can always store this stuff in aux fields if it makes sense to do so, but if they are not used for anything I don't see the point in storing it at all other than for completeness/lossless conversion.

If going from slow5->pod5, if the fields are needed, then I need to know how they are calculated so I can redo this if a slow5 file doesn't contain them in the aux fields already. Alternatively if not needed, then is it okay to provide these with some appropriate null type when writing the pod5 read?

Thanks for any insight you can give me.

Cheers,
James

P.S. When I'm done with my converter, it would be good to have a chat, as I have some thoughts about the API and internal data structures, and it would be good anyway to catch up on how things are going with pod5/slow5

Hi James,

num_minknow_events is a count of the number of internal minknow events in the file, this can be used to estimate bases in the file, bu multiplying by some conversion ratio.

The scale and shifts are pairs of numbers which relate to the level of the signal of the read. They are derived from some of internal metrics minknow builds as the experiment progresses. The intention is to make them available for scaling in the basecaller. The tracked value is based on previous reads from the same channel/mux, so is complex to recalculate. Dorado doesn't use the values at present.

num_reads_since_mux_change and time_since_mux_change are also intended for use in a downstream analysis pipeline where you need to decide which scaling parameters to use.

Alternatively if not needed, then is it okay to provide these with some appropriate null type when writing the pod5 read?

MinKNOW won't ever produce reads with null values here, but our fast5 convertor will use "nan" values in place for tracking values, and 0 for num_reads_since_mux_change and time_since_mux_change. These seem like the safe null replacement to me.

  • George

Hey George,

As always, very helpful.

For all the fields, are they calculated by MinKNOW and provided to the pod5 writer as is, or is there some extra calculation that happens?

To clarify, I imagine there is some pod5 read object that is created by minknow, and that is handed over to the writer module to handle. I'm trying to get a handle on where these values are calculated. Before or after being handed over to the pod5 writer.

Thanks

James

The values are calculated deep in the minknow analysis engine, where we have context over the ongoing state of each channel.

Hope that helps,

  • George

Yep, that helps.

That's all for now. I'll be sure to ping again if I run into issues.
As for now, I've got a working s2p and p2s converter for pod5<->slow5.
Doing some thorough checks before I release it
I appreciate the help so far.
Cheers,
James

Hey,

sorry just re-opening this for a quick related question.

[fields.num_minknow_events]
type = "int8"
description = "Number of minknow events that the read contains"

The spec shows this field type as int8

So between -128 and 127

Is that big enough to contain the number of minknow events in all cases? (again, I still don't quite understand what an "event" is in this context.

Also, are events ever negative?

Cheers,
James

Hmm,

I think the docs are wrong.

https://github.com/search?q=repo%3Ananoporetech%2Fpod5-file-format+num_minknow_events&type=code

Can you please update?

also, I think this should be returning an int rather than a float although not sure what the .as_py() is doing to the type here.

@property
def num_minknow_events(self) -> float:
"""
Find the number of minknow events in the read.
"""
return self._batch.columns.num_minknow_events[self._row].as_py() # type: ignore

I'll use a uint64 for that field for slow5.

Cheers,
James

I agree - I'll update.

  • George

Cheers,

I have also updated the pyslow5 API to handle those fields above, and my converter is now working pretty well.

Now time to move on and make it go fast.

Thanks for your help.
James

@jorj1988

I have three questions regarding the scaling/shift values. I guess these values are going to be used later in basecallers instead of z-score or quantile scaling, possibly because these are better alternatives to handle reads from genomics regions containing unbalanced % of ACGTs.

  1. What is the difference between tracked_scaling* vs predicted_scaling*?
  2. How can a signal be scaled using these shift and scale values? (signal_value+shift)*scale or (signal_value-shift)*scale or (signal_value+shift)/scale or (signal_value-shift)/scale?
  3. what determines if it is the tracked_scaling* or predicted_scaling* is best to be used?