Numba is a python library that offers Just-in-Time (JIT) compilation and allows you to write GPU kernels in Python. This repo demonstrates a few examples of using Numba:
example_vector_sum_and_average.ipynb
shows how to add two vectors and how to compute the average of elements in a vector. This example uses 1-dimensional blocks and threads.example_image_convolution.ipynb
shows how to run convolution on an image. This example uses 2-dimensional blocks and threads.example_mppi_numba_obstacle_avoidance.ipynb
shows how to parallelize rollouts based on Model Predictive Path Integral (MPPI) control proposed by Williams et al.. This implementation can run about 100x faster than the CPU implementation inexample_mppi_cpu.ipynb
. (The CPU implementation doesn't account for obstacles.)
Try these notebooks via Google Collab! Make sure to choose a GPU instance. "Runtime" -> "Change runtime type" -> "Hardware accelerator".
pip3 install numba scikit-image
- Host: the CPU
- Device: the GPU
- Host memory: the system main memory
- Device memory: GPU memory
- Kernel: a GPU function launched by host and executed on the device
- Device function: a GPU function that can only be invoked by kernels or other device functions
While the CPU is designed to excel at executing a sequence of operations, called a thread, as fast as possible and can execute a few tens of these threads in parallel, the GPU is designed to excel at executing thousands of them in parallel (amortizing the slower single-thread performance to achieve greater throughput).
Blocks are organized into a 1D or 2D or 3D grid of thread blocks as illustrated below. The number of thread blocks in a grid is usually dictated by the size of the data being processed, which typically exceeds the number of processors in the system.
- Grid (1D/2D/3D): a grid consists of blocks. Modern GPUs typically have more than 65,535 x 65,535 blocks.
- Block (1D/2D/3D): a block consists of threads. Modern GPUs typically have about 1024 threads per block.
- Thread: kernel functions are run by threads
Why organized this way??
- Some tasks are naturally viewed in 2D/3D (e.g., image processing, ray-tracing).
- The threads in the same block have access to some shared memory. Therefore, this allows the threads to work together, e.g., for computing convolution.
- Global memory (in a grid) is slower than shared memory (in a block) which is slower than local memory (in a thread).
Note that threads are excuted asynchronously, so synchronization may be required among the threads in a block and sometimes among the blocks in a grid.
Grid of Thread Blocks. (Image from Nvidia CUDA guide)
- Numba for CUDA GPUs
- The official CUDA C programming guide
- A pretty good playlist on Youtube (https://www.youtube.com/watch?v=4APkMJdiudU&list=PLC6u37oFvF40BAm7gwVP7uDdzmW83yHPe)
Find Xiaoyi (Jeremy) Cai (xyc@mit.edu)