/ZooSVD

A collection of High Performance Computational routines wrapped in Python to perform Singular Value Decomposition of dense matrix

Primary LanguageC++

ZooSVD

A collection of High Performance Computational routines wrapped in Python to perform Singular Value Decomposition of dense matrix in NumPy format.

Getting started

Dependencies are Eigen, LAPACKE, CUDA (for GPU wrappers). Can be installed with (on Ubuntu 22.04 LTS, adapt for your distribution):

apt install liblapacke-dev libeigen3-dev nvidia-cuda-toolkit

Compile the wrappers lib for Python:

mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release -DWITH_CUDA=1
make
make install

Then add the path to the folder PyZooSVD in your Python path to be able to import PyZooSVD. For example, add at the head of your python script:

import sys
sys.path.append("path_to_ZooSVD_folder") #replace the previous
import PyZooSVD

Documentation to come...

Available wrappers

The below table references the available wrappers with their algorithm and the data type they can operate (s for float32, d for float64, c for complex64 and z for complex128).

Wrapper Algorithm Data type Comments
LAPACK driver="gesvd" Householder reflections, Bidiagonisation s,d,c,z Equivalent to NumPy (slightly slower)
LAPACK driver="gesdd" Divide and Conquer s,d,c,z Equivalent to NumPy (slightly slower)
Eigen driver="jacobi" Two-sided Jacobi (with QR preconditionners) s,d Very slow
Eigen driver="bidiagdc" Divide and Conquer s,d Slow (to be reviewed)
SCALAPACK To come
CUDA driver="Jacobi" Jacobi s,d,c,z GPU algorithm
CUDA driver="Polar-Decomposition" Polar decomposition s,d,c,z GPU algorithm
CUDA driver="QR" Householder reflections, Bidiagonisation s,d,c,z GPU equivalent of LAPACK gesvd

More wrappers to come...

Performance

CPU performance (Laptop)

Time in second on a i7-1165G7 @ 2.8 Ghz (4 cores, 4 threads) for the SVD of a square matrix of the given size in float64 format:

Matrix size 512 1024 2048 4096 8192 10240 12288
NumPy 0.0651 0.358 2.49 19.7 141.2 291.9 504.6
LAPACK driver="gesvd" 20.52
LAPACK driver="gesdd" 0.068 0.400 2.59 20.2
Eigen driver="jacobi" > 300
Eigen driver="bidiagdc" 6.71

GPU performance

Time in second on a NVidia A40 (48 GB GDDR6) and NVidia A100 (40 GB HBM2) GPUs for the SVD of a square matrix of the given size in float64 format (NumPy on a 4 core CPUs as a reference):

Matrix size 512 1024 2048 4096 8192 10240 12288
NumPy 0.0651 0.358 2.49 19.7 141.2 291.9 504.6
CUDA - QR (A40) 0.922 1.25 3.10 14.2 84.8 152.6 262.6
CUDA - Polar-D (A40) 0.87 1.09 2.19 8.86 78.2 106.0 180.6
CUDA - Jacobi (A40) 0.799 0.889 1.26 4.24 28.5 54.9 107.7
CUDA - QR (A100) 1.12 1.29 2.33 7.12 35.57 61.48 94.87
CUDA - Jacobi (A100) 1.05 1.11 1.54 4.67 25.3 48.5 -
CUDA - Polar-D (A100) 1.01 1.07 1.17 1.82 5.48 8.98 13.6

GPU are faster than CPU implementations of SVD for matrix size above 1024. Fastest implementation are the Polar-Decomposition algorithm on NVidia A100 GPU, which is around 50 times faster than the CPU implementation of NumPy for large matrix (i7-1165G7 on 4 cores).