Unexpected behavior of NumbaLSODA and nb.cufnc in parallel mode

Question

Unexpected behavior of NumbaLSODA and nb.cufnc in parallel mode

amar-iastate opened this issue 3 years ago · 2 comments

Hi There,
Thanks for sharing this package. I am seeing up to 20x speedup in my code. As I am simulating the evolution of multiple initial conditions, I want to speed up the code further by using parallel=True flag in Numba. I adapted the example you provided in Stackoverflow for my code to use NumbaLSODA with parallel=True flag in Numba. However, I observed that the speedup was marginal even with 8 cores . In some cases, my parallel code was actually slower (!).

After debugging on the 2 state example using the code below, I narrowed the reason down to the way the parameters in the rhs function are defined. if the parameters are either global variables or passed using the data argument in lsoda, the parallel = True speeds up the code almost linearly versus the number of cores. However, if the parameters are defined locally as in f_local below, the speed up is marginal. In fact, the parallel version using f_local is slower than the series version using f_global or f_param.

I thought that local declarations of variables is a good coding practice and it speeds up code execution, but this does not seem to be the case with Numba and NumbaLSODA. I do not know if this behavior is caused by cfunc in Numba or NumbaLSODA. It was definitely unexpected and I thought I will bring it to your notice. Do you know the reason for this behavior? Are local constants not compiled just as global constants? I am unable to dig deeper into the reason and was wondering if you could help.
Best
Amar

from NumbaLSODA import lsoda_sig, lsoda
from matplotlib import pyplot as plt
import numpy as np
import numba as nb
import time

a_glob=np.array([1.5,1.5])

@nb.cfunc(lsoda_sig)
def f_global(t, u_, du, p): # variable a is global
    u = nb.carray(u_, (2,))
    du[0] = u[0]-u[0]*u[1]*a_glob[0]
    du[1] = u[0]*u[1]-u[1]*a_glob[1]

    
@nb.cfunc(lsoda_sig)
def f_local(t, u_, du, p): # variable a is local
    u = nb.carray(u_, (2,))
    a = np.array([1.5,1.5]) 
    du[0] = u[0]-u[0]*u[1]*a[0]
    du[1] = u[0]*u[1]-u[1]*a[1]

    
@nb.cfunc(lsoda_sig)
def f_param(t, u_, du, p): # pass in a as a paameter
    u = nb.carray(u_, (2,))
    du[0] = u[0]-u[0]*u[1]*p[0]
    du[1] = u[0]*u[1]-u[1]*p[1]

funcptr_glob = f_global.address
funcptr_local = f_local.address
funcptr_param = f_param.address
t_eval = np.linspace(0.0,20.0,201)
np.random.seed(0)
a = np.array([1.5,1.5])

@nb.njit(parallel=True)
def main(n, flag):
#     a = np.array([1.5,1.5])
    u1 = np.empty((n,len(t_eval)), np.float64)
    u2 = np.empty((n,len(t_eval)), np.float64)
    for i in nb.prange(n):
        u0 = np.empty((2,), np.float64)
        u0[0] = np.random.uniform(4.5,5.5)
        u0[1] = np.random.uniform(0.7,0.9)
        if flag ==1: # global
            usol, success = lsoda(funcptr_glob, u0, t_eval, rtol = 1e-8, atol = 1e-8)
        if flag ==2: # local
            usol, success = lsoda(funcptr_local, u0, t_eval, rtol = 1e-8, atol = 1e-8)
        if flag ==3: # param
            usol, success = lsoda(funcptr_param, u0, t_eval, data = a, rtol = 1e-8, atol = 1e-8)
        u1[i] = usol[:,0]
        u2[i] = usol[:,1]
    return u1, u2

@nb.njit(parallel=False)
def main_series(n, flag): # same function as above but with parallel flag = False
#     a = np.array([1.5,1.5])
u1 = np.empty((n,len(t_eval)), np.float64)
u2 = np.empty((n,len(t_eval)), np.float64)
for i in nb.prange(n):
    u0 = np.empty((2,), np.float64)
    u0[0] = np.random.uniform(4.5,5.5)
    u0[1] = np.random.uniform(0.7,0.9)
    if flag ==1: # global
        usol, success = lsoda(funcptr_glob, u0, t_eval, rtol = 1e-8, atol = 1e-8)
    if flag ==2: # local
        usol, success = lsoda(funcptr_local, u0, t_eval, rtol = 1e-8, atol = 1e-8)
    if flag ==3: # param
        usol, success = lsoda(funcptr_param, u0, t_eval, data = a, rtol = 1e-8, atol = 1e-8)
    u1[i] = usol[:,0]
    u2[i] = usol[:,1]
return u1, u2

n = 100
# calling first time for JIT compiling
u1, u2 = main(n,1)
u1, u2 = main(n,2)
u1, u2 = main(n,3)

u1, u2 = main_series(n,1)
u1, u2 = main_series(n,1)
u1, u2 = main_series(n,1)

# Running code for large number 
n = 10000
a1 = time.time()
u1, u2 = main(n,1) # global
a2 = time.time()
print(a2 - a1) # this is fast


a1 = time.time()
u1, u2 = main(n,2) # local
a2 = time.time()
print(a2 - a1) # this is slow

a1 = time.time()
u1, u2 = main(n,3) # param
a2 = time.time()
print(a2 - a1) # this is fast and almost identical performance as global

a1 = time.time()
u1, u2 = main_series(n,1) # global
a2 = time.time()
print(a2 - a1) # this is faster than local + parallel

a1 = time.time()
u1, u2 = main_series(n,2) # local
a2 = time.time()
print(a2 - a1) # this is slow

a1 = time.time()
u1, u2 = main_series(n,3) # param
a2 = time.time()
print(a2 - a1) # this is fast and almost identical performance as global

Answer 1 · 2022-02-03T16:57:49.000Z

The "local" cases are probably slower because the array a = np.array([1.5,1.5]) must be allocated every time the rhs function is called.

The "global" and "param" cases are faster because they have no allocations at all during integration.

For constants, in your rhs function, its normally a good idea to do something like this

def make_cfunc():
    a=np.array([1.5,1.5])
    @nb.cfunc(lsoda_sig)
    def f(t, u_, du, p): 
        u = nb.carray(u_, (2,))
        du[0] = u[0]-u[0]*u[1]*a[0]
        du[1] = u[0]*u[1]-u[1]*a[1]

    return f

In this case, a will be treated as constant, and will be "baked" into the rhs function. No allocations will occur, but you still don't have an annoying global variable in the scope of the other parts of your code.

This approach only works for constants. Alternatively, if a changes a bunch of times over the course of many integrations, you should pass a in as a parameter via p.

Answer 2 · 2022-02-05T15:00:31.000Z

Thanks for this clarification. I was of the opinion that the numba llvm compiler would compile the constant line a = np.array([1.5,1.5]) so that it will not need new allocations each time the function is called. It seems I was mistaken.
The make_cfunc() approach above makes sense and I will use this in my code.