gemm
There are 70 repositories under gemm topic.
OpenNMT/CTranslate2
Fast inference engine for Transformer models
DefTruth/CUDA-Learn-Notes
📚Tensor/CUDA Cores, 📖150+ CUDA Kernels, 🔥🔥toy-hgemm library with WMMA, MMA and CuTe(99%~100%+ TFLOPS of cuBLAS 🎉🎉).
CNugteren/CLBlast
Tuned OpenCL BLAS
flame/blislab
BLISlab: A Sandbox for Optimizing GEMM
Bruce-Lee-LY/cuda_hgemm
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
salykova/matmul.c
High-Performance FP32 Matrix Multiplication on CPU
yzhaiustc/Optimizing-SGEMM-on-NVIDIA-Turing-GPUs
Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.
mratsim/laser
The HPC toolbox: fused matrix multiplication, convolution, data-parallel strided tensor primitives, OpenMP facilities, SIMD, JIT Assembler, CPU detection, state-of-the-art vectorized BLAS for floats and integers
ROCm/Tensile
Stretching GPU performance for GEMMs and tensor contractions.
coderonion/awesome-cuda-and-hpc
🔥🔥🔥 A collection of some awesome public CUDA, cuBLAS, TensorRT and High Performance Computing (HPC) projects.
cp2k/dbcsr
DBCSR: Distributed Block Compressed Sparse Row matrix library
yui0/slibs
Single file libraries for C/C++
yzhaiustc/Optimizing-DGEMM-on-Intel-CPUs-with-AVX512F
Stepwise optimizations of DGEMM on CPU, reaching performance faster than Intel MKL eventually, even under multithreading.
ROCm/hipBLASLt
hipBLASLt is a library that provides general matrix-matrix operations with a flexible API and extends functionalities beyond a traditional BLAS library
Bruce-Lee-LY/cuda_hgemv
Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.
enp1s0/ozIMMU
FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme
aredden/torch-cublas-hgemm
PyTorch half precision gemm lib w/ fused optional bias + optional relu/gelu
CoffeeBeforeArch/mmul
Serial and parallel implementations of matrix multiplication
hma02/cublasHgemm-P100
Code for testing the native float16 matrix multiplication performance on Tesla P100 and V100 GPU based on cublasHgemm
andylolu2/simpleGEMM
The simplest but fast implementation of matrix multiplication in CUDA.
BoooC/CNN-Accelerator-Based-on-Eyeriss-v2
A Flexible and Energy Efficient Accelerator For Sparse Convolution Neural Network
hma02/cublasgemm-benchmark
code for benchmarking GPU performance based on cublasSgemm and cublasHgemm
eth-cscs/spla
Specialized Parallel Linear Algebra, providing distributed GEMM functionality for specific matrix distributions with optional GPU acceleration.
iVishalr/GEMM
Fast Matrix Multiplication Implementation in C programming language. This matrix multiplication algorithm is similar to what Numpy uses to compute dot products.
szagoruyko/openai-gemm.pytorch
PyTorch bindings for openai-gemm
mz24cn/gemm_optimization
The repository targets the OpenCL gemm function performance optimization. It compares several libraries clBLAS, clBLAST, MIOpenGemm, Intel MKL(CPU) and cuBLAS(CUDA) on different matrix sizes/vendor's hardwares/OS. Out-of-the-box easy as MSVC, MinGW, Linux(CentOS) x86_64 binary provided. 在不同矩阵大小/硬件/操作系统下比较几个BLAS库的sgemm函数性能,提供binary,开盒即用。
XiaoSong9905/dgemm-knl
DGEMM on KNL, achieve 75% MKL
KarhouTam/cuda-kernels
Some common CUDA kernel implementations (Not the fastest).
Bruce-Lee-LY/cuda_back2back_hgemm
Use tensor core to calculate back-to-back HGEMM (half-precision general matrix multiplication) with MMA PTX instruction.
yui0/ugemm
GEMM
merledu/magma-si
Matrix Accelerator Generator for GeMM Operations based on SIGMA Architecture in CHISEL HDL
TensorBFS/CuTropicalGEMM.jl
The fastest Tropical number matrix multiplication on GPU
zixuanweeei/gemm-opt
Manually optimize the GEMM (GEneral Matrix Multiply) operation. There is a long way to go.
foreverrookie/cuda-opt-samples
CUDA optimization samples including sgemm, reduce... To be continued.