/WCycleSVD

Primary LanguageCudaBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

W-cycle SVD is a multilevel algorithm for batched SVD on GPUs. W-cycle SVD is size-oblivious, which successfully exploits the data reuse and ensures the optimal convergence speed for multiple SVDs. To push the envelope of performance further, we design the efficient batched SVD and EVD kernels, and propose a tailoring strategy to accelerate batched GEMMs in SVDs. This repository includes the full source code of W-cycle SVD program and some appended contents for the convenience of those experiments reported in our paper.

Abstract Checklist

Platforms:

NVIDIA CUDA platform

The GPUs we used are

  • Tesla V100
  • Tesla P100
  • GTX Titan X

AMD ROCm platform

The GPU we used are:

  • Vega20

System Details:

  • 18.04-Ubuntu x86_64 GNU/Linux (V100, P100 and TiTan X)
  • CentOS 7.9 (A100)
  • CentOS 7.6 (AMD GPU)

Software Dependencies:

  • GNU Make 4.1
  • CUDA toolkit (tested 10.1, 11.6)
  • nvprof
  • gcc/g++ (tested 4.8.5, 7.5)
  • ROCm (tested 3.5, 4.2)
  • Intel oneMKL (tested 2022.1.0)
  • MAGMA (tested 2.5.4)

Environment Setup

Basic environment

CUDA Platform:

CUDA toolkit (version more than 10.1) should be installed. The compiler used is nvcc. Extra libraries needed are MAGMA, cuSOLVER and cuBLAS. The MAGMA we used depends on CUDA toolkit and intel oneMKL.

ROCm Platform (single GPU):

ROCm toolkit (version more than 4.2) should be installed. The compiler used is hipcc. Extra libraries needed are MAGMA(hip), The MAGMA we used depends on ROCm toolkit and intel oneMKL.

Compile the program

The project can be accessed on the Github by this link.

Use git (http, ssh, etc.) to clone the repository into a local directory.

For the 4 environments on which our artifact was mainly tested, there are 4 branches:

  • main_CUDA,
  • test_Tensor_Core,
  • test_HIP,
  • test_Cluster

After cloned the branch corresponding to the environment. Run make in the root directory.

Prepare necessary data

For the main_CUDA branch, The data are too large to store in the repository of Github. Please generate them by running the following command:

unzip data/UF_matrixset.zip
./test 99

Experiments list

This list shows all the experiments in the revision paper.

V100, P100 and GTX TiTan X (main_CUDA branch):

Time of one-sided Jacobi methods in different cases. (Fig.1)
Run: ./test 1
One-sided Jacobi method for a batched SVD of 100 matrices with each size of 1536×1536. (Fig 2)
Run: ./test 2
Different tile sizes for two batched GEMMs at Level 1 of W-cycle SVD with two levels for 100 matrices. (TABLE I)
Run: ./test 3
W-cycle SVD for improvement over cuSOLVER with matrix size below 32. (Fig.7)
Run: ./test 4
Comparison with cuSOLVER using batch size=1 with matrix size between 500 and 10000. (Fig.8(a))
Run: ./test 5
W-cycle SVD for performance improvement with matrix size between 64 and 1024. (Fig.8(b))
Run: ./test 6
W-cycle SVD for improvement over MAGMA. (Fig.9)
Run: ./test 7
Time(s) for SVDs of 200 Matrices on P100 GPU. (TABLE IV)
Run: ./test 8
Evaluation on different approaches in W-cycle SVD, one warp or $\alpha$ warps (Fig.10(a))
Run: ./test 9
Evaluation on different approaches in W-cycle SVD, original or parallel EVD. (Fig 10(b))
Run: ./test 10
GPU occupancy. (Fig.11(a))
Run: ./test11.sh
GM transaction. (Fig.11(b))
Run: ./test12.sh
Improvements of the tailoring strategy. (Fig.12)
Run: ./test 13
Time(s) of W-cycle SVD with different tailoring plans. (TABLE V)
Run: ./test 14
Evaluation of W-cycle SVD with various matrix sizes, with SuiteSparse matrix set. (TABLE VI)
Run: ./test 15
Sensitivity on different GPUs. (Fig 14(a))
Run: ./test 17
Evaluation on the accuracy and convergence speed. (TABLE VII)
Run: ./test 18
Evaluation on the accuracy. (Fig 15(a))
Run: ./test 19
Evaluation on convergence speed. (Fig 15(b))
Run: ./test 20

A100 (test_Tensor_Core branch):

Evaluation on A100 GPU with tensor cores (Fig.13)
Run: ./test 16

Vega20 (test_HIP branch):

Sensitivity on different GPUs. (Fig 14(a))
Run: ./svd

GPU cluster (test_Cluster branch):

Data assimilation application. (Fig 14(b))
Run: sbatch test18.slurm
The number of GPUs used is defined in the test18.slurm script. After the program finished, the result will write in test18.o.