xraypy/xraylarch

Adding numerical jacobian for the AUTOBK process

Closed this issue · 7 comments

@newville

Current xraylarch has a bottleneck on the AUTOBK process and the main reason is because of the very slow calculation of the Jacobian in the FITPACK. It is not an elegant solution, but I succeeded in implementing an analytical gradient in the xraytsubaki which gave x3-4 times performant code.

I attached the profiling of the numerical and analytical gradient done in the xraytsubaki below.
https://github.com/Ameyanagi/xraytsubaki/blob/main/doc/profiling.md

I can implement the same logic to xraylarch, but I have one problem that scipy leastsq doesn't accept custom Jacobian. we need to switch to least-square which is a newer interface. is there any problem with switching the fitpack wrapper?

and should i try implementing the analytical Jacobian to xraylarch?

The two main codes are the following.
https://github.com/Ameyanagi/xraytsubaki/blob/main/src/xafs/background.rs#L689
https://github.com/Ameyanagi/xraytsubaki/blob/main/src/xafs/mathutils.rs#L386

@Ameyanagi That's all really interesting and worth pursuing.

I probably would have done a couple of things. First, I might try switching from using lmfit to plain scipy.optimize.leastsq. That could have some impact, but probably not a factor of 2. But, for autobk, we are not using any of the lmfit features. Second, I might play with the tolerances and epsfcn values to see if the number of iterations can be reduced.

scipy.optimize.leastsq can take an analytic Jacobian - its Dfun argument. And, sure, if derivatives can be done analytically, it is worth pursuing.

FWIW, I would not expect least-squares to be faster than leastsq. But I think that switching between the two would not be a big effort compared to implementing the Jacobian in the first place.

@newville

I just realized that I misunderstood the arguments for leastsq. It has a Dfun argument that can pass in the Jacobian.

I will first try this argument and implement the analytical Jacobian.

@Ameyanagi I just pushed some changes for autobk that give a speedup of close to 2x.

Switching to plain scipy.optimize.leastsq gives maybe a 50% increase, which seemed worth doing.

I also changed to have calc_uncertainties=False to not calculate the arrays for delta_chi(k) and delta_bkg(E) for every run of autobk. This process is sort of slow, so was moved to a separate function (autobk_delta_chi) that can now be called for that for a group that has had autobk run. Initializing a Feffit Dataset will call that function if needed.

There was also a small improvement to be made in checking for duplicate energies in remove_dups().

For me, these give almost a factor of 2. That should put running autobk at around 10 milliseconds on many modern-ish laptops. I am not aware of anyone collecting 1,000 spectra in a minute at any sustained rate. If someone is looking to collect data at that rate, they are going to have to consider processing and storage. A 10-node machine could process 1000 spectra per second. I kind of think we're optimized enough for a generation ;).

@newville

Thank you for the nice changes. I was working on implementing analytical Jacobian, but so far, I am not getting a significant speed up. It is probably because I need to rewrite the core functions of the FITPACK in Python, which has so many for loops, that the slow down we get with the for loops is larger than the increase we get from the analytical Jacobian.

I can technically write the Jacobian with a compiled language and expose an API, but I think it would be a bad idea in terms of maintaining.
I think we cannot optimize furthermore at this point.

BTW, python will remove the GIL, which prevents us from running multithread codes, maybe we could make the autobk_par to speed up the heavy loading of the AUTOBK process.

@Ameyanagi Yes, I sort of think that analytic derivatives here would be hard to maintain, even if it was shown to be a significant speed-up.

I think removing the GIL will be good in general, but probably not for Autobk (at least in the short term). Basically, non-thread-safe Fortran code is repeatedly calling a Python code to get a residual array and using that for the minimization. Most of the numerical work is in Fortran, even if the runtime is dominated by calling that one Python objective function (that is also computationally bound by fitpack, and so not I/O bound). The GIL won't help in that situation.

What could help would be to re-do the LM solver (say, using C) to be thread-safe and to then call the Python objective function in parallel (Nvary times, each in its own C thread and CPU core) when calculating numerical derivatives. A main process would launch Nvary threads, wait for them to finish, create the Jacobian, use that to predict the next step, and repeat until stable. That could dramatically speed up all fits with more than 2 or 3 parameters.

I think that the thread-safe solver in C does exist (https://github.com/devernay/cminpack), but having a main C loop that calls the user-supplied Python objective function in parallel threads and then assembles the results would be real work. ;)

But also, I really believe that 0.01 s for Autobk is fast enough for the next decade. Splitting jobs across 10 cores gives you 1000 spectra processed per second. When that is too slow, going to 50 faster CPUs could get you another order of magnitude.

FWIW, I'm in the process of adding "autobk-style background" to Feffit.

@newville

I agree that it might not be good in terms of thread safety for autobk. I think it is better to keep the current code.
One of the reasons I was working on Rust was because there is a concept of "fearless concurrency," which makes it very easy to write a thread-safe code and parallelize it.

I will close this issue.

@Ameyanagi Just to be clear, I don't mean that using analytic Jacobians causes thread safety. It's sort of worse than that: MINPACK / lmdif / scipy.optimize.leastsq are not thread-safe. They are single-thread.

One could imagine building the finite-difference Jacobian by doing Nvarys simultaneous calculations. In C (not Fortran), and with thread safety in mind (and this does exist now!). That could distribute most of the function calls to up to Nvarys threads/CPU cores, which could easily give 5 to 10x speedups.

For that to be able to call objective functions in Python is another problem. You either have to work hard for picklability (that's unlikely for lmfit with constraints, but might work for autobk), or you have to go multi-interpreter - or maybe the new sub-interpreters could work.

This would be real work - but also might be possible (which I might have been less optimistic about in the past). And it has the potential of really speeding up all Python fits by 5 to 10x.