sgemm

There are 11 repositories under sgemm topic.

Liu-xiandong/How_to_optimize_in_GPU
This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, sgemv, sgemm, etc. The performance of these kernels is basically at or near the theoretical limit.
Language:Cuda803 13 15126
wangzyon/NVIDIA_SGEMM_PRACTICE
Step-by-step optimization of CUDA SGEMM
Language:Cuda206 2 433
salykova/matmul.c
Fast multi-threaded matrix multiplication in C
Language:C163 5 07
mz24cn/gemm_optimization
The repository targets the OpenCL gemm function performance optimization. It compares several libraries clBLAS, clBLAST, MIOpenGemm, Intel MKL(CPU) and cuBLAS(CUDA) on different matrix sizes/vendor's hardwares/OS. Out-of-the-box easy as MSVC, MinGW, Linux(CentOS) x86_64 binary provided. 在不同矩阵大小/硬件/操作系统下比较几个BLAS库的sgemm函数性能，提供binary，开盒即用。
Language:C14 3 05
Stefan20162016/maxas-explained
maxas Scott Grey's maxas assembler sgemm explaining the (for me) missing parts https://github.com/NervanaSystems/maxas
Language:CSS13 1 03
yui0/ugemm
GEMM
Language:C10 2 23
c3sr/scope
A benchmark framework for POWER and x86_64
Language:Mathematica7 6 481
fsword73/SGEMM_on_VEGA
An alternative SGEMM implementation on AMD Vega Series
Language:Assembly7 2 14
JunLee85/ARM32-SGEMM-LIB
a fast sgemm lib with fix 16 enable on arm 32
Language:C3 0 03
XiaoSong9905/cuda-v100-kernels
CUDA Kernels on V100
Language:Cuda3 1 01
aidevnn/CuPyFirstExample
CuPy first example computing GEMM with cuBlas, with handwritten cuda kernel and with NumPy-blas
Language:Cuda2 01

sgemm

Liu-xiandong/How_to_optimize_in_GPU

wangzyon/NVIDIA_SGEMM_PRACTICE

salykova/matmul.c

mz24cn/gemm_optimization

Stefan20162016/maxas-explained

yui0/ugemm

c3sr/scope

fsword73/SGEMM_on_VEGA

JunLee85/ARM32-SGEMM-LIB

XiaoSong9905/cuda-v100-kernels

aidevnn/CuPyFirstExample