- CPU
- GPU
- PyTorch CUDA: Numeric parallelization similar to NumPy
- Numba CUDA: Easy parallelization
- cuDF: DataFrame parallelization similar to Pandas (by RAPIDS)
- cuML: Machine learn. parallelization similar to Scikit-learn (by RAPIDS)
- cuGraph: graph parallelization similar to NetworkX (by RAPIDS)
- CuPy: GPU matrix library similar to NumPy
- PyCuda
- PyOpenCL
- Dask: Distributed parallelization
Due to python GIL (global interpreter lock), only a single thread can acquire that lock at a time, which means the interpreter ultimately runs the instructions serially
Threading is useful in:
- GUI programs: For example, in a text editing program, one thread can take care of recording the user inputs, another can be responsible for displaying the text, a third can do spell-checking, and so on.
- Network programs: For example web-scrapers. In this case, multiple threads can take care of scraping multiple webpages in parallel. The threads have to download the webpages from the Internet, and that will be the biggest bottleneck, so threading is a perfect solution here. Web servers, work similarly.
import threading
def func(x):
return x*x
thread1 = threading.Thread(target=func, args=(4))
thread2 = threading.Thread(target=func, args=(5))
thread1.start() # Starts the thread asynchronously
thread2.start() # Starts the thread asynchronously
thread1.join() # Wait to terminate
thread2.join() # Wait to terminate
Multiprocessing outshines threading in cases where the program is CPU intensive and doesn’t have to do any IO or user interaction. For example, any program that just crunches numbers.
import multiprocessing
def func(x):
return x*x
process1 = multiprocessing.Process(target=func, args=(4))
process2 = multiprocessing.Process(target=func, args=(5))
process1.start() # Start the process
process2.start() # Start the process
process1.join() # Wait to terminate
process2.join() # Wait to terminate
import multiprocessing
def f(x):
return x*x
cores = 4
pool = multiprocessing.Pool(cores)
pool.map(f, [1, 2, 3])
PyTorch multiprocessing is a wrapper around the native multiprocessing module. It supports the exact same operations, but extends it, so that all tensors sent through a multiprocessing.Queue
import torch.multiprocessing as mp
if __name__ == '__main__':
num_processes = 4
processes = []
for rank in range(num_processes):
p = mp.Process(target=func, args=(x))
p.start()
processes.append(p)
for p in processes:
p.join()
Just-in-time (JIT) compiler for python. Works well with loops and numpy, but not with pandas
Numba also caches the functions after first use as a machine code. So after first time it will be even faster because it doesn’t need to compile that code again.
- Object mode
@jit
: Only good for checking errors with python - Compile mode
@jit(nopython=True)
or also@njit
: Good machine code performance - Multithreading
@jit(nopython=True, parallel=True)
: Good if your code is parallelizable- Automatic multithreading of array expressions and reductions
- Explicit multithreading of loops with
prange()
:for i in prange(10):
- External multithreading with tools like concurrent.futures or Dask.
- Vectorization SIMD
@vectorize
@vectorize(target='cpu')
: Single-threaded CPU@vectorize(target='parallel')
: Multi-core CPU@vectorize(target='cuda')
: CUDA GPU
from numba import jit
@jit
def function(x):
# your loop or numerically intensive computations
return x
@jit(nopython=True)
def function(a, b):
# your loop or numerically intensive computations
return result
@jit(nopython=True, parallel=True)
def function(a, b):
# your loop or numerically intensive computations
return result
import torch
print("GPU available:", torch.cuda.is_available())
print("GPU name: ", torch.cuda.get_device_name(0))
tensor = torch.FloatTensor([1., 2.]).cuda()
tensor = tensor.operations ...
result = tensor.cpu()
torch.cuda.memory_allocated() # Memory usage by tensors
torch.cuda.memory_cached() # Cache memory (visible in nvidia-smi)
torch.cuda.empty_cache() # Free cache memory
import cupy as cp