parallelize detection and classification

Question

parallelize detection and classification

jbmelander opened this issue a year ago · 11 comments

@magland I am finding mountainsort5 to be excellent but a bit slow for long (2+ hour recordings) with many channels. Is there a reason why the detection and classification in scheme 2 cannot be threaded or parallelized with multiprocessing? If its simply the case that nobody has had the time to implement this, I will take a stab - but I just was wondering if there was something I was missing that prevented this from parallelization.

Answer 1 · 2023-07-31T14:26:58.000Z

Hi @jbmelander

That would be great if you could take a stab at parallelizing this!

I should mention that many parts are already multi-threaded by numpy. So I think it would be important to identify the bottleneck steps before endeavoring to do this. I haven't thought too much about this, but I think it may be non-trivial to optimize.

But anything you can do would be great!

Answer 2 · 2023-07-31T15:58:23.000Z

@magland Thanks for the feedback. There is at least one place that I think could benefit from spawning multiple processes, but you never know if the effort is futile until you try. I'll give it a shot over the next few days and report back here.

Answer 3 · 2023-07-31T16:04:48.000Z

Sounds good!

Answer 4 · 2023-10-02T17:26:06.000Z

I was able to speed up scheme2 significantly but in the end it looks like this had less to do with the parallelization and more to do with increasing the chunk size. Might I suggest an argument for scheme2 parameters that enables users to dictate this? I can submit a PR if that helps.

Answer 5 · 2023-10-02T17:55:53.000Z

I was able to speed up scheme2 significantly but in the end it looks like this had less to do with the parallelization and more to do with increasing the chunk size. Might I suggest an argument for scheme2 parameters that enables users to dictate this? I can submit a PR if that helps.

Thanks @jbmelander it would be great if you could submit a PR for that. I think the default should be None, which would use the default calculated chunk size as is now. Also, the name of the parameter should be specific enough to distinguish it from the other durations (it's okay if it's a long parameter name). Feel free to suggest a change with the default as well -- I think this mostly concerns memory usage.

Answer 6 · 2023-10-02T18:00:06.000Z

Sure, I'll submit a PR later today. It's useful for users with computers with significant RAM and agree the default should be what is currently set. Thanks! I'm still using MS on an experimental basis until I can solve a few more issues, but once I do, I'm happy to revisit the parallelization code, which did contribute to some speed improvements (but negligible compared to utilizing more memory).

Answer 7 · 2023-10-02T18:13:04.000Z

Sounds great, and don't hesitate to star the repo if you find it useful. :)

Answer 8 · 2023-10-02T18:19:55.000Z

:) Definitely. I will say that ms5 seems to give the best sorting results I've seen yet on my data - in terms of identifying real cells. I'm mainly trying to find parameters that doesn't miss smaller amplitude cells. On a separate note, do you know of any users that have applied MS4 or 5 to a Neuropixels dataset? I'd love to ask them what kinds of preprocessing and parameters were used. You can email me at melander@stanford.edu if you want to continue this in private.

Answer 9 · 2023-10-02T18:29:22.000Z

Glad it's working well for your data! I don't think anyone has been able to use MS4 for neuropixels data, since the implementation is not suited for very large channel counts. I haven't yet heard of any reports yet about folks using MS5 for NP, although I tested 128-channel subsets of NP datasets during development.

Answer 10 · 2023-10-02T18:31:43.000Z

Well, if I have any success I will let you know :)
Can you explain why it isn't suited for large channel counts?

edit: I know this is off-topic. We can close the issue after our next correspondence.

Answer 11 · 2023-10-02T21:51:45.000Z

Well, if I have any success I will let you know :) Can you explain why it isn't suited for large channel counts?

There are some parts of the implementation (not inherent to the algorithm) which are quadratic time in the number of channels. Feel free to open an issue on that repo if you want to get into more detail.