skovaka/UNCALLED

[QUESTION] 10X run-time for reads of 500 raw signals

Closed this issue · 12 comments

Hello,
I'm running uncalled_map with 60 threads on a 100Kb long reference and I'm observing the run-time is approximately 10X higher if average signal length is 500 instead of 1000. Adapters are trimmed, r9.4.1 DNA reads.
Do you think there is a reason for this?

That is a little surprising, but I think it's a product of the branching alignment algorithm. You would hope that the runtime would only double if the signal length doubles, but it's not necessarily linear because of the branching. I mostly benchmarked signal chunks in 4000 sample segments, so I wonder if the potentially exponential branching levels off at some point between 1000 and 4000 samples? That would definitely be reference specific, but your reference isn't very long.

@skovaka
I agree to your point but I think your answer mostly answers my below observation.
One other thing I observed is for average read length (in raw samples) in range (1000,10000,500), the throughput in Msamples/sec increases monotonically.

@skovaka
I'm wondering if UNCALLED has a one-time fixed cost (possibly memory related) per read (<~0.1s), say path buffer allocation or something like that? I have excluded the FAST5 loading time while observing this.
This would help explain my last noted observation.

Sorry the delay, I understand your question better now. Longer reads do take longer to align, since each event in an entire read is iterated over with the default "uncalled map" command. Additionally, each event iteration is not constant time, since the branching search algorithm considers many paths per event. Repetitive sequences can make the branching worse, so the read length/time relationship is not even consistent between reads. You could set the "max chunks" option to limit the number of chunks attempted to align, which would lessen the effect of read length and is how uncalled works in realtime mode, but time per read still wouldn't necessarily be constant. Hope that clears things up!

Thank You for your answer @skovaka. I really appreciate you taking your time to clarify my doubt.
I understand your point that read length/time relation is not consistent between reads. However, I'm still not sure I found the answer I was looking for.
I don't think I made the following point clear: Please note that I am comparing number of reads processed/time between the same set of reads and not different reads. Additionally, all my reads are non-targets (human) and I try to map it to a ~100Kb microbial target reference.
Let me explain my experiment a bit.

  1. I have N reads which are all atleast 10K raw samples long (subseted using ont_fast5_api).
  2. Thereafter, I used fast5_research to trim the reads and create sub-datasets from the same N reads. Each sub-dataset would have the same fixed number of raw samples per read- 500,1000,1500,2000,2500...10000. I have 20 sub-datasets now.
  3. I was talking about running UNCALLED on these 20 sub-datasets, each having the same N reads (but different lengths) and finding the average time per read. I observed the average time per read is in the range 100.1 to 101.16ms on a Intel Cascade Lake core. Now, increasing the sample length of the same set of reads (by X20) does not seem to have increased the mapping time significantly.
  4. I see your point that mapping time per event can be different due to branching but the above trend in (3) seems to point to some kind of fixed one-time cost that UNCALLED pays for the mapping the first 500 samples or so.

I'm just trying to find your expert opinion on what that cost would be/if (3) makes sense to you in some way.

I see, so the question is why it does NOT take longer to align longer reads? The main per-read constant cost would be loading from fast5s, everything else should be allocated once. I bet it's because most reads align within the first few thousand samples, and it will not continue aligning after a confident alignment is identified, so additional signal won't make much difference. This is not how an aligner like minimap works, since we're not aiming for full-length alignments. Sorry for the confusion, I realize I was wrong yesterday when I said it iterates over the entire read, it's been awhile since I worked on the algorithm

@skovaka
Yes, you got my question. Thank You so much.
I may just be seeing FAST5 loading time.

Is UNCALLED done loading FAST5's at this step?
I see the file read happens [here]:(

file.read(sig_path, int_data);
).

If not, then I am worried there may be something else that we are missing. I'm trying to map all non-target reads. UNCALLED does it very well mapping only 2.5% of the non-target reads (10K sample per read sub-dataset) and even less of them with lesser samples per read. And I don't think UNCALLED's default configuration has an early termination heuristic that saves time for non-target reads that don't map (for less than 10K samples), right?

I may be wrong.
But, looking at the code, I think you are loading fast5's only at this step here. So the thread pool is basically loading and mapping not just mapping.
So, I was just counting the file loading time.

Thank You.

Are you profiling the C++ binary with a single thread? Python would have additional overhead, and I wouldn't be surprised if my thread pool is not the most efficient. The thread pool loads fast5s in the main thread, then mapping happens within the worker threads, so it's quite possible time is dominated by main thread fast5 loading. You could actually try running the "uncalled_map_ord" binary, which loads all fast5s initially and then aligns reads in the order they were sequenced. That's not the default mode since it requires more memory, but could be helpful for you

@skovaka
Yes, I am trying uncalled_map (C++ binary) with single thread.
Am I correct to note that FAST5 loading happens [here] for every thread?(

fast5s_.fill_buffer();
) for every thread?

Let me try uncalled_map_ord. Seems to suit my use case. Thank You.

Fast5 loading does happen there for "uncalled_map", but map_ord uses a different thread pool. To be clear, that function is executed on the main thread, so the fast5 loading is always single threaded, but it's loading reads into buffers which are processed by the worker threads here

void MapPool::MapperThread::run() {

Thank You, Sam Kovaka.
UNCALLED is an excellent tool. Thank You for maintaining the repo.
I really enjoyed looking into the code and the paper.