conda create --name gpu_stack python=3.8 -y
conda activate gpu_stack
conda install ipykernel jupyter nb_conda_kernels pandas numba cudatoolkit
conda install -c conda-forge cupy cudnn cutensor nccl
conda install -c conda-forge jupyter_contrib_nbextensions
conda install -c conda-forge jupyter_nbextensions_configurator
conda install tbb
conda install -c numba icc_rt
pip install RISE
-
JIT takes more time to compile function 1st time. - Done
-
If things are running in object mode, code written in numba will take more time than pure python mode. - Done
-
@njit
or@jit(nopython=True)
is same. Alwaysnjit
to ensure performant code. -
Use numpy arrays, not python lists with numba.
-
Whenever possible use
@vectorize
. Write a function as scaler. Use@vectorize
. It will work for scaler & vector both. -
When writing numba code you need to have a trade off between pythonic code vs c-ish code. Example,
for index in indexes
will change tofor index in range(len(indexes))
. -
To use all threads using
@njit(nogil=True)
. If you don't want use it, you will not be benefited from usingThreadPoolExecutor
. -
Using thread pool executor is same as using
parallel=True
flag in@njit
decorator. Make sure the problem isembarrassingly parallel
. In case if you are using this useprange
in place ofrange
and install TBB. -
float32
is great. use it wherever possible. -
If you don't care much about floating point precision
fastmath=True
is your friend. -
Remember numba supports limited functions. A good starting point is to check numba support for numpy & python.
-
If you really with efficient python code always use
@njit
. It will fail to execute if you code is falling back to object mode. This is not the case with@jit
. It will execute your code in object mode and will throw some warning. Due to this it can be very slow and defeat the purpose if the objective is to achieve speed. -
To call back python inside
@njit
mode usenb.objmode
context manager. Numba isJIT
compiler. Static typing is a must have for cython code. This makes cython less flexible compared to numba. -
LLVM takes care of the backend and different architecture. Hence if you want to leverage GPU or CPU, it is less stressful.
-
To check all the dependencies use
numba -s
. -
MKL, BLAS, SVML, TBB is great explore them. If you have intel CPU MKL + Intel Python is great to generate synthetic data.
-
Talk about
ufuncs
&gufuncs
. Also, aboutvectorize
&guvectorize
. -
If have GPU used,
target='cuda'
. -
numba.stencil
is great for convolution or sliding window or any other neighborhood computation. For C callbacks usenumba.cfunc
. -
Show how to use numba with pandas.
-
There is a test suite present in numpy with all sort of numerical comparison. Wherever possible use that.
-
Measure, measure & measure. Snakeviz is a good profiler if you prefer web UIs. Prioritize what needs to be optimized.
-
deepcopy does not work with numba.
-
Talk about default profiler which comes with numba.
from time import pref_counter
.foo.py_func()
we can use to call python function. -
Talk about single and multiple instructions in numba.
-
Example that how we are using this to calculate DTW distance using numba in multiplier.
Reference: