jeff-regier/Celeste.jl

Investigate hotspot loops

Closed this issue · 5 comments

Keno commented

The following are the non-vetorized hot spots that could benefit from loop vectorization or adjustments to their memory access patterns. For each one of these, we need to figure out what the primary bottleneck is (memory bandwidth, cache locality, ipc throughput, etc):

  • calculate_var_G_s!
  • calculate_source_pixel_brightness!
  • combine_sfs_hessian!
  • fill!
  • first_quad_form!
  • add_sources_sf!

The first thing to do is to come up with a benchmark harness that can call each of these in isolation, so we can investigate with performance counters/track performance changes for these.

Thanks for the hotspot list. However, these are mainly associated with Hessian computations which Jarrett and Jeff are trying to remove right now so we'll need to make a new hotspot list once their changes are ready.

Keno commented

Yes, we will. However I think we should go ahead and start optimizing these, because most of these will have similar problems, so we should start finding out what those are.

The cg2 branch is works, but it's 4x slower while only having 3x as many floating point operations. I think it's better to stay to leave it unmerged until after GB. The cg2 branch needs some algorithmic work ("preconditioning") to be ready.

I'll add a new benchmark (benchmark_elbo_likelihood.jl) that profiles a typical evaluation of the elbo_likelihood function. The vast majority of runtime for a real execution of Celeste should be in elbo_likelihood (and the functions it calls), including all the hotspots identified in this issue.

Keno commented

That's not quite what we need. We do actually want a different benchmark (can be in the same file), for each of these hot spot functions a few thousand times with made up data. The purpose of that is to amplify performance differences as we make them in each of the hot spots.