numagic/lumos

casadi backend performance issue

Closed this issue · 7 comments

Description

There seems to be still some performance we could extract from the casadi backend, particularly:

  1. When we run the thread mapped casadi functions, it doesn't seem that we gain any improvement (also from logging of wallatime vs processor time, it doesn't seem to parallelize it). Why is that happening?

This should work. If we run casadi's parallel example, and make the change to run it with thread parallel, it does how a performance improvement inline with what one would expect.

  1. the calling of the mapped model functions should be the largest overhead. However this is actually only a small proportion of time used in the model function calls, which indicates that we have a large overhead somewhere in the numpy part. (tiling, repeating, collecting inputs and outputs, potentially)

eg: LTC with 2500 intervals, LGR3, this is a factor of ~4 (seems to be constant with # of intervals)
total model algebra con, jac, hess, including wrapper

INFO:lumos.optimal_control.nlp:model_algebra.constraints: 0.477527
INFO:lumos.optimal_control.nlp:model_algebra.jacobian: 0.872570
INFO:lumos.optimal_control.nlp:model_algebra.hessian: 0.889361

only casadi compiled fun, con, jac, hess

model_algebra.constraints: 0.1592513918876648
model_algebra.jacobian: 0.1893054485321045
model_algebra.hessian: 0.17499386072158812

TODO:

  • reduce cmap I/O forming overhead
  • make multi-threading work (or at least understand why it doesn't work)

the code snippet needed to be added to the casadi example to test map on compiled functions:

# Compiled
import subprocess

codegen = CodeGenerator("gen.c")
codegen.add(f0)
codegen.generate()
cmd = ["gcc", "-O2", "-shared", "-pthread", "gen.c", "-o", "gen.so"]
p = subprocess.Popen(cmd)
p.wait()
fcompiled = external("f", "./gen.so")

for num_workers in np.arange(12) + 1:
    fMap = fcompiled.map(N, "thread", num_workers)
    print(f"evaluating parallel map compiled function with {num_workers} threads...")
    t0 = time.time()
    outMap = fMap(dummyInput)
    t1 = time.time()
    print(
        f"evaluated parallel map compiled function with {num_workers} threads in {t1-t0} seconds"
    )

breaking down the cost in cmap, especially around the real mapped function call here, we get timing:

Constraints:

total time: 0.49493
pure call time time: 0.18737
prep time: 0.28976
post time: 0.01779

Jacobian

pure call time time: 0.21109
prep time: 0.36234
post time: 0.55019
total time: 1.16590

Hessian

pure call time time: 0.27124
prep time: 0.34162
post time: 0.59676
total time: 1.20221

This seems to show that:

  • for low-dimensional and non-sparse outputs (constraints), the inputs preparation is the main overhead
  • for high-dimensional and sparse outputs, both inputs prep and post-processing result in large overheads

Both the nnz computation, and the reshape costs around the same amount of time (for 2500 LGR3, jac, 0.26sec each)

This seems to suggest that this is the cost for converting the casadi object to sparser. This is later proved by pre-converting instead of doing it twice

the creation of casadi mapped function here seems to be the most significant overhead in 'prep'. But this function mapping is done for a fixed batch size (a bit different to jax). So we need to pass in the correct batch size to start with

with #27, now only the overhead of getting the casadi nonzero elements from a sparse matrix is left.

For 2500 LGR3, before:

NFO:lumos.optimal_control.nlp:model_algebra.constraints: 0.583765
INFO:lumos.optimal_control.nlp:model_algebra.jacobian: 1.155053
INFO:lumos.optimal_control.nlp:model_algebra.hessian: 1.242703

after:

INFO:lumos.optimal_control.nlp:model_algebra.constraints: 0.233584
INFO:lumos.optimal_control.nlp:model_algebra.jacobian: 0.338013
INFO:lumos.optimal_control.nlp:model_algebra.hessian: 0.418324

NOTE: while changing the mapping method (using "thread" or not, or how many workers) make no difference to run time. In the profiling it does seem to show there are parallelism benefits! Maybe it's just the mapped functions are parallelized in its core? But this is not supported by the casadi's own example script, which shows that only using map with "thread" gives performance benefits

TODO: should double check if the multithreading has any difference between MacOS + conda and containerized conda env. So far it seems that the containerized conda env does parallelize