Computational efficiency
Closed this issue · 4 comments
Over the weekend I ran a bunch of whole-body registrations using cross-correlation, and the result seems really a lot better compared to SSD, so we should probably consider using it in the future. The problem is that our implementation of NCC is way too slow, it's taking around 60-70 minutes to register a single subject on my machine.
The problem IMHO is that the current design of the registration engine, while being very easy and maintainable, is also very inefficient from a computational perspective. One of the most obvious issues is having a function call to compute the energy of each single voxel. Apart from the obvious overhead due to the function calls themselves, it breaks vectorisation and prevents many elementary optimisations. I was thinking that maybe we should refactor the cost functions to compute the cost over a whole block in a single call. Especially for NCC, that would allow to drop the naïve algorithm and implement a more optimised version that prevents redundant calculations, the speedup could be huge.
Another thing that I had in mind, even if I have to think a bit more about it, is that since the displacement is discrete, it may be possible to resample the images before registration on a common grid that accounts for spacing and step size, it would take some extra memory but it would also save a lot of repeated calculations later.
Yeah, NCC is a lot better but the benefit/cost isn't fully investigated. The cost functions are definitely our bottleneck (shared first place with maxflow), but given that our method is faster than other methods that use NCC I think we aren't doing too shabby.
I was very skeptical to my model for cost functions at first, given that we have multiple virtual calls for each voxel and we have no respect for the cache at all. But I attempted several approaches (more cache efficient, avoiding vtables, etc) and this was the cleanest and I couldn't really see any extra cost.
Currently the biggest culprit is linear_at
(>50% of runtime last time i checked). I think the cost of vtable lookups, etc, are negligible. The memory access pattern for linear_at
in its current form is really bad for cache efficiency. I've tried optimizing what I can and it should be fully vectorized on windows.
I did some attempts to optimize NCC by computing it block by block, but the problem is that I couldn't find any real overlap of work between voxels (at least not in terms of sampling). As the result can't be dependent on neighbouring displacements NCC assumes local rigidity. So any samples obtained in one voxel can't be reused, otherwise this would be a great way to reduce the number of linear_at
significantly.
I think the next step is to actually move to GPU, there we would get really cheap cost functions and all the cache problems would be gone. I've been investigating graph-cut on GPU during the summer so if I have the time I will probably attempt to implement something.
If you see any possible improvements you are free to try them out of course. Cost functions per block are quite simple to implement, the troublesome part is that you probably need some intermediate storage between the cost functions and the graph cut interface. Unless you want to provide it directly to the cost function, which would probably end in a mess.
That's definitely a possibility. ITK does something like that if I remember correctly.
Thanks for the details, it helps to see your point of view.
Definitely, the difference in quality I got was significant. With SSD the alignment of internal organs is almost completely random...
Yes, it is tricky business for sure. All the constraints due to graphcut and handling the image coordinates are so annoying. Over the weekend I was considering the idea of disjointing resampling from NCC calculations, exploiting the fact that the displacement is discrete, trying to do the resampling of a relevant portion of image once on an isotropic grid that has the step length as spacing, thus totally replacing linear_at
with a regular array access in the following calculations. But the balance may depend on the ratio between spacing and step length, for small step lengths it may be unfeasible in terms of memory or make things slower instead. The problem is that at the moment I am working on something like 4 different things in parallel, so I hadn't much time to think in a lucid way and see if it is feasible or whether I am just hallucinating. (Probably I am just hallucinating.)
Sure moving to GPGPU would help a lot, that's also why I wasn't even sure whether to spend time even thinking about this.
Really... That's interesting. Then we should definitely use NCC.
As you mentioned I think this will quickly get out of hands in terms of memory. You would also need to introduce a lot of special cases for constraints or initial deformations that aren't guaranteed to stick to the grid. I have been thinking a bit about a specialized linear_at
for NCC as we could probably be a bit more cache-friendly.
Haha, that's understandable. I feel the same. I've had a hard time to accept the runtime so I've put a lot of effort into reducing it. I think the runtime has been reduced by 60% or so since our first version. I definitely want to reduce it more since all time spent on waiting for results is a waste of time.
I think the most time-efficient investment for now is to go GPU so that's my focus for now. This is also the best step forward research-wise because I need more publications. However, speed-ups are always welcome and the CPU implementation will always be there. We just need to be a bit careful with how much effort we put into it.
That seems a sound plan. I still have to define what to work on in the next period, I am doing a bunch of stuff now but without any long term plan. I think I will keep this in mind when deciding though, since optimising the runtime and moving to CUDA was also supposed to be one of the tasks in my study plan. So probably we can close this for now, maybe I will reopen if there is some idea that is a bit more concrete.