Run the commands below to get started:
$ ssh <YourNetID>@adroit.princeton.edu
$ cd /scratch/network/$USER
$ git clone https://github.com/jdh4/python-gpu.git
$ cd python-gpu
NumPy (for CPUs)
import np as np
X = np.random.randn(10000, 10000)
Y = np.matmul(X, X)
CuPy (for GPUs)
import cupy as cp
X = cp.random.randn(10000, 10000)
Y = cp.matmul(X, X)
Let's compare the performance of the two. Inspect the two Python scripts:
$ cd python-gpu/cupy
$ cat matmul_numpy.py
$ cat matmul_cupy.py
The difference between the two scripts is minimal as expected. Run the jobs and compare the timings:
$ sbatch cupy.slurm
$ sbatch numpy.slurm
In the above case we are comparing the CuPy code running on 1 CPU-core and 1 A100 GPU versus NumPy code running on 8 CPU-cores and no GPU. Which of the two libraries performs faster for this problem?
The code below calculates the singular value decomposition of a matrix using NumPy. Convert the code to CuPy and run it.
import numpy as np
X = np.random.randn(5000, 5000)
u, s, v = np.linalg.svd(X)
print(f"s.sum() = {s.sum()}")
Hint: You only need to change at most 6 characters.
Take a look at the CuPy reference manual. Can you use CuPy in your research?
JAX is Autograd and XLA, brought together for high-performance machine learning research. JAX can be used for:
- automatic differentiation of Python and NumPy functions (more general then TensorFlow)
- a good choice for non-conventional neural network architectures and loss functions
- accelerating code using a JIT
- carrying out computations using multiple GPUs/TPUs
Take a look at an example from the JAX GitHub page:
$ cd python-gpu/jax
$ cat example.py
import jax.numpy as jnp
from jax import grad, jit, vmap
def predict(params, inputs):
for W, b in params:
outputs = jnp.dot(inputs, W) + b
inputs = jnp.tanh(outputs) # inputs to the next layer
return outputs # no activation on last layer
def loss(params, inputs, targets):
preds = predict(params, inputs)
return jnp.sum((preds - targets)**2)
grad_loss = jit(grad(loss)) # compiled gradient evaluation function
perex_grads = jit(vmap(grad_loss, in_axes=(None, 0, 0))) # fast per-example grads
Run the code with:
$ sbatch submit.sh
Take a look at all of the JAX examples. You can run any of the examples by modifying example.py with the example you want to run.
See our JAX knowledge base page for installation directions.
Numba is an open source JIT compiler that translates a subset of Python and NumPy code into fast machine code.
On compiling Python to machine code:
When Numba is translating Python to machine code, it uses the LLVM library to do most of the optimization and final code generation. This automatically enables a wide range of optimizations that you don't even have to think about.
Take a look at a sample CPU code:
import numpy as np
def myfunc(a):
trace = 0.0
# assuming square input matrix
for i in range(a.shape[0]):
trace += np.tanh(a[i, i])
return a + trace
Using Numba we can speed the code up:
@numba.jit(nopython=True)
def myfunc(a): # function is compiled to machine code when called the first time
trace = 0.0
# assuming square input matrix
for i in range(a.shape[0]): # numba likes loops
trace += np.tanh(a[i, i]) # numba likes numpy functions
return a + trace
Run the two codes to check the performance difference:
$ cd python-gpu/numba
$ cat cpu_without_numba.py
$ sbatch cpu_without_numba.slurm
$ cat cpu_with_numba.py
$ sbatch cpu_with_numba.slurm
Let's look at a GPU example. According to the Numba for GPUs webpage:
Numba supports CUDA GPU programming by directly compiling a restricted subset of Python code into CUDA kernels and device functions following the CUDA execution model. Kernels written in Numba appear to have direct access to NumPy arrays. NumPy arrays are transferred between the CPU and the GPU automatically.
View the Python script:
$ cat example_gpu.py
import numpy as np
from numba import cuda
@cuda.jit
def my_kernel(io_array):
# thread id in a 1D block
tx = cuda.threadIdx.x
# block id in a 1D grid
ty = cuda.blockIdx.x
# block width, i.e. number of threads per block
bw = cuda.blockDim.x
# compute flattened index inside the array
pos = tx + ty * bw
if pos < io_array.size: # check array boundaries
io_array[pos] *= 2
if __name__ == "__main__":
# create the data array - usually initialized some other way
data = np.ones(100000)
# set the number of threads in a block
threadsperblock = 256
# calculate the number of thread blocks in the grid
blockspergrid = (data.size + (threadsperblock - 1)) // threadsperblock
# call the kernel
my_kernel[blockspergrid, threadsperblock](data)
# print the result
print(data[:3])
print(data[-3:])
Run the job:
$ sbatch numba_gpu.slurm
Can you use Numba in your work to accelerate Python code on a CPU or GPU?
For more, see High-Performance Python for GPUs by Henry Schreiner.
One can install Numba for GPUs with:
$ conda create --name numba-env numba cudatoolkit
Be aware of the Rapids GPU libraries by NVIDIA for analytics and machine learning using conventional models. Rapids is something like pandas, Scikit-learn and NetworkX (and more) running on GPUs.
See the Intro to GPU Computing repo for PyTorch and TensorFlow examples. For installation directions and more, see our PyTorch and TensorFlow knowledge base pages.
See the Intro to GPU Computing repo for GPU examples in other languages.
See 05_cuda_libraries
of the Intro to GPU Computing repo.