cudaMatMul

Practice cuda by matrix multiplication.

Author: Sunqianqi

Device	NVIDIA A100-PCIE-40GB
CUDA Driver Version / Runtime Version	11.4 / 11.2
CUDA Capability Major/Minor version number	8.0
Total amount of global memory	39.59 GBytes (42505273344 bytes)
GPU Clock rate	1410 MHz (1.41 GHz)
Memory Clock rate	1215 Mhz
Memory Bus Width	5120-bit
L2 Cache Size	41943040 bytes
Max Texture Dimension Size (x,y,z)	1D=(131072), 2D=(131072,65536), 3D=(16384,16384,16384)
Max Layered Texture Size (dim) x layers	1D=(32768) x 2048, 2D=(32768,32768) x 2048
Total amount of constant memory	65536 bytes
Total amount of shared memory per block	49152 bytes
Total number of registers available per block	65536
Warp size	32
Maximum number of threads per multiprocessor	2048
Maximum number of threads per block	1024
Maximum sizes of each dimension of a block	1024 x 1024 x 64
Maximum sizes of each dimension of a grid	2147483647 x 65535 x 65535
Maximum memory pitch	2147483647 bytes

Blacknana/cudaMatMul