SMDV is a collection of modules to test the speed of execution of
multiplication for different matrices formats.
Multiplications are performed on the GPU using technology NVIDIA CUDA.
Multiplication algortihms ERTILP and SERTILP were originally created for KMLib - SVM with CUDA
Implemented multiplication formats (in CUDA C):
- ELLPACK
- SLICED ELLPACK
- ERTILP
- SERTILP
- CSR
SMDV also allows use of the multiplication module Numpy. Such calculations will be performed only on the CPU. You can use them as a reference point for measurements performed on the GPU.
- Krzysztof Sopyła "ksirg" krzysztofsopyla@gmail.com
- Paweł Drozda "pdrozda"
- Sławomir Figiel "fivitti"
If you use this software in academic publication please cite: (TODO)
SMDV is licensed under the MIT License.
The main purpose of the project is to measure the execution time
of multiplications on the GPU for different matrix formats.
You can repeat our research on your own computer (you only need to have a
compatible Nvidia GPU with CUDA).
You can also carry out their own experiments and their own data
independently chosen parameters.
SMDV allows easy collaboration with external scripts.
SMDV further includes tools to help take the measurements for different data:
- Convert the matrix to the supported formats
- You can use this to see how the matrices are processed before performing the multiplications
- Handled by the module matrixformat.py.
- Command line interfaces in module cli-conv.py.
- Return result multiplication
- Methods for multiplying addition to the time the calculation return also the result of these calculations. You can use it to check its accuracy, either as an end in itself. It's a great, quick way to obtain a result of matrix by vector multiplication.
- Handled by the module matrixmultiplication.py.
- Command line interfaces in module cli-multiply.py.
- Get CUDA kernels for multiplication
- You can view the source files to see exactly how the multiplication is performed. You can also use the kernels implemented in their programs. In order to further optimize thecodes are written in CUDA C using metaprogramming.
- Files in CUDA C are located in the kernels.
- Module for metaprogramming and code compilation is handled by cudacodes.py.
- Get info about matrix and vector files
- You can quickly get information about the many of matrices stored in the format .mtx and Numpy vectors in .npy and export this information to a CSV file.
- Command line interfaces in module cli-info.py.
- Simple and fast generating matrix and vectors
- You can generate random matrices and vectors for testing on selected sizes and densities.
- Handled by the module matrixutilites.py.
- Command line interfaces in module cli-gen.py.
- And more...
SMDV uses the packages:
- PyCUDA
- Numpy
- SciPy
- Click (for CLI)
- and standard library, of course
The following are examples of the results of test using the 10 largest matrices from matrix choice of Francisco Vazquez.
The calculations were performed on a computer:
- CPU: Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz
- GPU Card: GeForce GTX 780 Ti
- CUDA Capability Major: 3.5
- CUDA Driver version: 6.0
Version software:
- OS: Ubuntu 13.10
- python 2.7.5+
- PyCuda 2013.1.1
- Numpy 1.7.1
Arguments:
- Block size: 128
- Threads per row: 2
- Slice size: 64
- Prefetch: 2
- Align: 32
- Repeats: 10
- Compensation: 10 - So much has been made multiplications "starting". Their results were not included.
Test vectors are generated at random in the range of -1 to 1. Their density is 15%. You can download them from the folder vectors.
Informations about matrices:
Matrix | Rows | Columns | NNZ | Sparsity |
---|---|---|---|---|
cant | 62451 | 62451 | 4007383 | 0.103% |
consph | 83334 | 83334 | 6010480 | 0.087% |
cop20k_A | 121192 | 121192 | 2624331 | 0.018% |
mac_econ_fwd500 | 206500 | 206500 | 1273389 | 0.003% |
mc2depi | 525825 | 525825 | 2100225 | 0.001% |
pwtk | 217918 | 217918 | 11634424 | 0.024% |
qcd5_4 | 49152 | 49152 | 1916928 | 0.079% |
rma10 | 46835 | 46835 | 2374001 | 0.108% |
shipsec1 | 140874 | 140874 | 7813404 | 0.039% |
wbp256 | 65537 | 65537 | 31413934 | 0.731% |
Results:
Matrix | Format | Avr time [ms] | Std [ms] |
---|---|---|---|
cant | CPU | 7.658 | 0.021 |
CSR | 0.444 | 0.002 | |
ELLPACK | 0.348 | 0.001 | |
SLICED | 0.290 | 0.003 | |
SERTILP | 0.275 | 0.002 | |
ERTILP | 0.320 | 0.001 | |
consph | CPU | 11.493 | 0.014 |
CSR | 0.521 | 0.001 | |
ELLPACK | 0.419 | 0.002 | |
SLICED | 0.373 | 0.001 | |
SERTILP | 0.356 | 0.001 | |
ERTILP | 0.423 | 0.001 | |
cop20k_A | CPU | 6.627 | 0.010 |
CSR | 0.710 | 0.002 | |
ELLPACK | 0.545 | 0.001 | |
SLICED | 0.537 | 0.002 | |
SERTILP | 0.523 | 0.012 | |
ERTILP | 0.528 | 0.004 | |
mac_econ_fwd500 | CPU | 2.734 | 0.008 |
CSR | 1.025 | 0.025 | |
ELLPACK | 0.696 | 0.001 | |
SLICED | 0.674 | 0.002 | |
SERTILP | 0.629 | 0.002 | |
ERTILP | 0.648 | 0.002 | |
mc2depi | CPU | 4.556 | 0.160 |
CSR | 1.868 | 0.003 | |
ELLPACK | 0.818 | 0.002 | |
SLICED | 0.844 | 0.001 | |
SERTILP | 0.836 | 0.003 | |
ERTILP | 0.827 | 0.001 | |
pwtk | CPU | 22.317 | 0.009 |
CSR | 1.268 | 0.001 | |
ELLPACK | 1.073 | 0.003 | |
SLICED | 0.904 | 0.002 | |
SERTILP | 0.872 | 0.001 | |
ERTILP | 1.006 | 0.002 | |
qcd5_4 | CPU | 3.224 | 0.020 |
CSR | 0.295 | 0.001 | |
ELLPACK | 0.190 | 0.001 | |
SLICED | 0.190 | 0.002 | |
SERTILP | 0.185 | 0.000 | |
ERTILP | 0.183 | 0.001 | |
rma10 | CPU | 3.925 | 0.007 |
CSR | 0.330 | 0.001 | |
ELLPACK | 0.283 | 0.001 | |
SLICED | 0.245 | 0.004 | |
SERTILP | 0.226 | 0.001 | |
ERTILP | 0.258 | 0.002 | |
shipsec1 | CPU | 15.051 | 0.047 |
CSR | 0.932 | 0.003 | |
ELLPACK | 0.787 | 0.005 | |
SLICED | 0.712 | 0.001 | |
SERTILP | 0.681 | 0.000 | |
ERTILP | 0.773 | 0.003 | |
wbp256 | CPU | 81.732 | 0.026 |
CSR | 1.873 | 0.005 | |
ELLPACK | 1.608 | 0.002 | |
SLICED | 1.153 | 0.003 | |
SERTILP | 1.085 | 0.002 | |
ERTILP | 1.515 | 0.007 |
How to make SMDV to action and self test the speed of matrix multiplication?
There are two ways. You can write a script in Python, or use the attached CLI.
We assume that you have a test matrices and vectors of length equal to the number of columns of the matrix. If not, you can visit MatrixMarket . We also recommend matrix choice of Francisco Vazquez . You can also use tools provided to generate data described in next section or use package Numpy or Scipy.
We will use module matrixmultiplication.py. It is the core of the whole process. This module include moduls cudacode.py and matrixformat.py. You do not have to worry about it. You only need to use core-module.
The complete script multiplication script is in examples/multiplication.py will work according to the scheme:
- Read data
- Determination of parameters
- Matrix multiplication
- Results
Call multiply this format is very simple. As any of the methods has in arguments matrix, vector and the number of repetitions of this operation. In addition, a takes one special parameter - block size.
m = matrixmultiplication.multiply_ellpack(matrix,
vector,
block_size=128,
repeat=10)
print 'List of execution times: {0}'.format(m[1])
This method has three specials parameters: threads per row, slice size, align - constant to calculation and set align of the matrix.
m = matrixmultiplication.multiply_sliced(matrix,
vector,
align=64,
slice_size=32,
threads_per_row=2,
repeat=10)
print 'List of execution times: {0}'.format(m[1])
This method has specials parameters: threads per row, slice size, align, prefetch - number of requests for access to data notified in advance.
m = matrixmultiplication.multiply_sertilp(matrix,
threads_per_row=2,
slice_size=32,
prefetch=2,
align=64,
repeat=10)
print 'List of execution times: {0}'.format(m[1])
This method has specials parameters: threads per row, block size, prefetch - number of requests for access to data notified in advance.
m = matrixmultiplication.multiply_ertilp(matrix,
threads_per_row=2,
prefetch=2,
block_size=128,
repeat=10)
print 'List of execution times: {0}'.format(m[1])
This method has only one specials parameters: block size.
m = matrixmultiplication.multiply_csr(matrix,
vector,
block_size=128,
repeat=10)
print 'List of execution times: {0}'.format(m[1])
This method using fuction dot from numpy. It is here for comparison. It works only on CPU.
m = matrixmultiplication.multiply_cpu(matrix,
vector,
repeat=10)
print 'List of execution times: {0}'.format(m[1])
CLI to multiply is in module cli-multiply.py. Description all parameters get using flag --help.
As arguments required program takes two paths. First indicates vector, and the second matrix.
If the directory path vectors will then be searched for files .npy. The program alone try to match the vector for the given matrix. Will match the last alphabetically vector of length equal to the number of columns of the matrix If you do not succeed will be displayed message.
If the path is a directory matrix program will search it for files .mtx. Then try to perform calculations for each array in turn.
Specifying of two paths as directory is a simple way to automate testing on a collection.
You can easily specify all the parameters.
Multiplications are making a choice of formats by using the flag:
-ell (ELLPACK), -sle (SLICED), -ert (ERTILP), -see (SERTILP),
-csr (CSR) and -cpu (multiplication on CPU).
CLI allows you to display the result of multiplication, all execution times, average execution time and standard deviation of the mean.
You can also test the correctness of the returned results with a confidence factor. To do this, use the flag --test. The result will be the result of a standard function dot with Numpy package. If errors do not coming within specified range will be displayed corresponding messages.
You can also save the results of the measurements in the CSV file with the flags -o.
Multiplications matrix by vectors in ELLPACK, ERTILP and CPU formats with own parameters and print average time and standard deviation.
user@host:~/SMDV$ python cli-multiply.py -b 128 -tpr 4 -p 2 -ell
-ert -cpu -avr -std Data/vectors/Vector_62451.npy Data/real/cant.mtx
Multiply matrix Data/real/cant.mtx by the vector Data/vectors/Vector_62451.npy
Multiplication with Numpy (only CPU)
Average time [ms]: 7.84460783005
Standard deviation: 0.0
Multiplication with ELLPACK
Average time [ms]: 0.6616320014
Standard deviation: 0.0
Multiplication with ERTILP
Average time [ms]: 0.466080009937
Standard deviation: 0.0
Auto search vector:
user@host:$ python cli-multiply.py Data/vectors/ Data/real/cant.mtx
Multiply matrix Data/real/cant.mtx by the vector Data/vectors/Vector_62451.npy
Auto search matrices and vectors:
user@host:~/projekty/SMDV$ python cli-multiply.py Data/vectors/ Data/real/
Multiply matrix Data/real/mac_econ_fwd500.mtx by the vector Data/vectors/Vector_206500.npy
Multiply matrix Data/real/cop20k_A.mtx by the vector Data/vectors/Vector_121192.npy
Multiply matrix Data/real/wbp256.mtx by the vector Data/vectors/Vector_65537.npy
In your script use module matrixutilites.py. Use method generate_sparse_matrix
to generate sparse matrix (for very dense is inefficient).
Use method generate_vector to generate dense vector (for very sparse is inefficient).
You can save the matrix generated using method save_matrix_to_file. Format file is Market Matrix File (.mtx).
For vectors use save_vector_to_numpy_file. Format file is Numpy Binary File (.npy).
These methods are "overlay" functions scipy.io.mmwrite and numpy.save to help you create a name for the file.
You can use CLI cli-gen to generate data.
Use command matrix to matrix generation. Parameter -s INT INT
specifies dimension matrix (width, height). You can use it repeatedly. Please note the exhaustion
of memory when generating many large matrices. Using other parameters you can specify the minimum,
maximum, percentage of zeros, precision and the type of generated numbers. Use --help for more info.
Use command vector to vector generation. Parameter -l INT specifies length of vector. You can use it repeatedly. For info about other parameters use --help.
Use command save to save all generated data. Can you define directory and filename.
Use command echo to display generated matrix and vectors. Caution when large matrices.
The commands are executed sequentially from left to right. Their order is important! First should be generating command (vectors or/and matrix). Then are (optional) save command and display command.
user@host:$ python cli-gen.py vector -l 5 -i matrix -s 3 3 -min 1 -max 5 echo
Vector (len: 5):
(0) 0
(1) 4
(2) -9
(3) 5
(4) 0
Matrix (rows: 3, cols: 3):
(0, 0) 5.0
(0, 1) 1.0
(1, 1) 1.0
(1, 2) 4.0
(2, 1) 5.0
cli-info returns basic information about matrices and vectors stored in files, such as size, number of non-zero value and density. You can save this information to a CSV file using parameter -o FILEPATH. You can also view the contents of a small matrix.
CLI as an argument to the paths to files and directories. If you specify the path to the directory will be checked all the matrices and vectors within it.
user@host:~/SMDV$ python cli-info.py Data/Generated/
Process: Data/Generated/Matrix_67x67.mtx
Info about matrix:
rows 67
cols 67
nnz 4
sparsing [%] 0.0891
Process: Data/Generated/Vector_5.npy
Info about vector:
length 5
nnz 2
sparsing [%] 40.0
You can use the formats used by us in their projects. CUDA code is located in the kernels.
Kernels are prepared to metaprogramming.
It is built and compiled by cudacodes.py. Methods get_cuda_FORMAT (where FORMAT is name of kernel)
return tuple with multiplication function and texture. Some of them take arguments to optimize the code for a given situation.
If you enter an incorrect value of the code will not work correctly.
Your job is to just set the texture to the correct address in the GPU memory and call the function with the appropriate parameters.
For examples see to matrixmultiplication.py where we use it.
- "Efficient Sparse Matrix-Vector Multiplication on CUDA", Nathan Bell, Michael Garland [11 December 2008]
- "The sparse matrix vector product on GPUs", F. Vazquez, E. M. Garzon, J. A. Martınez, J. J. Fernandez [14 June 2009]
- "Improving the performance of the sparse matrix vector product with GPUs", F. Vazquez, G. Ortega, J.J. Fernandez, E.M. Garzon [2010]
- "Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures", Alexander Monakov, Anton Lokhmotov, and Arutyun Avetisyan
- "A memory efficient and fast sparse matrix vector product on a gpu", A. Dziekonski, A. Lamecki, and M. Mrozowski