W-cycle SVD is a multilevel algorithm for batched SVD on GPUs. W-cycle SVD is size-oblivious, which successfully exploits the data reuse and ensures the optimal convergence speed for multiple SVDs. To push the envelope of performance further, we design the efficient batched SVD and EVD kernels, and propose a tailoring strategy to accelerate batched GEMMs in SVDs. This repository includes the full source code of W-cycle SVD program and some appended contents for the convenience of those experiments reported in our paper.
- Tesla V100
- Tesla P100
- GTX Titan X
- Vega20
- 18.04-Ubuntu x86_64 GNU/Linux (V100, P100 and TiTan X)
- CentOS 7.9 (A100)
- CentOS 7.6 (AMD GPU)
- GNU Make 4.1
- CUDA toolkit (tested 10.1, 11.6)
- nvprof
- gcc/g++ (tested 4.8.5, 7.5)
- ROCm (tested 3.5, 4.2)
- Intel oneMKL (tested 2022.1.0)
- MAGMA (tested 2.5.4)
CUDA toolkit (version more than 10.1) should be installed. The compiler used is nvcc. Extra libraries needed are MAGMA, cuSOLVER and cuBLAS. The MAGMA we used depends on CUDA toolkit and intel oneMKL.
ROCm toolkit (version more than 4.2) should be installed. The compiler used is hipcc. Extra libraries needed are MAGMA(hip), The MAGMA we used depends on ROCm toolkit and intel oneMKL.
The project can be accessed on the Github by this link.
Use git
(http
, ssh
, etc.) to clone the repository into a local directory.
For the 4 environments on which our artifact was mainly tested, there are 4 branches:
- main_CUDA,
- test_Tensor_Core,
- test_HIP,
- test_Cluster
After cloned the branch corresponding to the environment. Run make
in the root directory.
For the main_CUDA
branch, The data are too large to store in the repository of Github. Please generate them by running the following command:
unzip data/UF_matrixset.zip
./test 99
This list shows all the experiments in the revision paper.
Time of one-sided Jacobi methods in different cases. (Fig.1)
Run: ./test 1
One-sided Jacobi method for a batched SVD of 100 matrices with each size of 1536×1536. (Fig 2)
Run: ./test 2
Different tile sizes for two batched GEMMs at Level 1 of W-cycle SVD with two levels for 100 matrices. (TABLE I)
Run: ./test 3
W-cycle SVD for improvement over cuSOLVER with matrix size below 32. (Fig.7)
Run: ./test 4
Comparison with cuSOLVER using batch size=1 with matrix size between 500 and 10000. (Fig.8(a))
Run: ./test 5
W-cycle SVD for performance improvement with matrix size between 64 and 1024. (Fig.8(b))
Run: ./test 6
W-cycle SVD for improvement over MAGMA. (Fig.9)
Run: ./test 7
Time(s) for SVDs of 200 Matrices on P100 GPU. (TABLE IV)
Run: ./test 8
Evaluation on different approaches in W-cycle SVD, one warp or
Run: ./test 9
Evaluation on different approaches in W-cycle SVD, original or parallel EVD. (Fig 10(b))
Run: ./test 10
GPU occupancy. (Fig.11(a))
Run: ./test11.sh
GM transaction. (Fig.11(b))
Run: ./test12.sh
Improvements of the tailoring strategy. (Fig.12)
Run: ./test 13
Time(s) of W-cycle SVD with different tailoring plans. (TABLE V)
Run: ./test 14
Evaluation of W-cycle SVD with various matrix sizes, with SuiteSparse matrix set. (TABLE VI)
Run: ./test 15
Sensitivity on different GPUs. (Fig 14(a))
Run: ./test 17
Evaluation on the accuracy and convergence speed. (TABLE VII)
Run: ./test 18
Evaluation on the accuracy. (Fig 15(a))
Run: ./test 19
Evaluation on convergence speed. (Fig 15(b))
Run: ./test 20
Evaluation on A100 GPU with tensor cores (Fig.13)
Run: ./test 16
Sensitivity on different GPUs. (Fig 14(a))
Run: ./svd
Data assimilation application. (Fig 14(b))
Run: sbatch test18.slurm
The number of GPUs used is defined in the test18.slurm
script. After the program finished, the result will write in test18.o
.