gemm

There are 70 repositories under gemm topic.

OpenNMT/CTranslate2
Fast inference engine for Transformer models
Language:C++3.4k 60 706306
flame/how-to-optimize-gemm
Language:C1.8k 44 18355
DefTruth/CUDA-Learn-Notes
📚Tensor/CUDA Cores, 📖150+ CUDA Kernels, 🔥🔥toy-hgemm library with WMMA, MMA and CuTe(99%~100%+ TFLOPS of cuBLAS 🎉🎉).
Language:Cuda1.5k 13 8164
CNugteren/CLBlast
Tuned OpenCL BLAS
Language:C++1.1k 58 328202
flame/blislab
BLISlab: A Sandbox for Optimizing GEMM
Language:C485 16 1104
Bruce-Lee-LY/cuda_hgemm
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
Language:Cuda310 4 1268
salykova/matmul.c
High-Performance FP32 Matrix Multiplication on CPU
Language:C305 5 017
yzhaiustc/Optimizing-SGEMM-on-NVIDIA-Turing-GPUs
Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.
Language:Cuda284 7 745
mratsim/laser
The HPC toolbox: fused matrix multiplication, convolution, data-parallel strided tensor primitives, OpenMP facilities, SIMD, JIT Assembler, CPU detection, state-of-the-art vectorized BLAS for floats and integers
Language:Nim281 14 2815
ROCm/Tensile
Stretching GPU performance for GEMMs and tensor contractions.
Language:Python225 56 105151
coderonion/awesome-cuda-and-hpc
🔥🔥🔥 A collection of some awesome public CUDA, cuBLAS, TensorRT and High Performance Computing (HPC) projects.
160 5 022
cp2k/dbcsr
DBCSR: Distributed Block Compressed Sparse Row matrix library
Language:Fortran135 20 21047
yui0/slibs
Single file libraries for C/C++
Language:C117 14 111
yzhaiustc/Optimizing-DGEMM-on-Intel-CPUs-with-AVX512F
Stepwise optimizations of DGEMM on CPU, reaching performance faster than Intel MKL eventually, even under multithreading.
Language:C116 5 123
ROCm/hipBLASLt
hipBLASLt is a library that provides general matrix-matrix operations with a flexible API and extends functionalities beyond a traditional BLAS library
Language:Assembly64 17 3889
Bruce-Lee-LY/cuda_hgemv
Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.
Language:Cuda50 4 04
enp1s0/ozIMMU
FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme
Language:Cuda46 4 12
aredden/torch-cublas-hgemm
PyTorch half precision gemm lib w/ fused optional bias + optional relu/gelu
Language:Cuda39 3 32
CoffeeBeforeArch/mmul
Serial and parallel implementations of matrix multiplication
Language:C++35 4 26
hma02/cublasHgemm-P100
Code for testing the native float16 matrix multiplication performance on Tesla P100 and V100 GPU based on cublasHgemm
Language:Cuda34 4 013
andylolu2/simpleGEMM
The simplest but fast implementation of matrix multiplication in CUDA.
Language:Cuda33 2 03
BoooC/CNN-Accelerator-Based-on-Eyeriss-v2
A Flexible and Energy Efficient Accelerator For Sparse Convolution Neural Network
Language:Verilog32 1 21
hma02/cublasgemm-benchmark
code for benchmarking GPU performance based on cublasSgemm and cublasHgemm
Language:Cuda29 3 017
eth-cscs/spla
Specialized Parallel Linear Algebra, providing distributed GEMM functionality for specific matrix distributions with optional GPU acceleration.
Language:C++27 9 87
iVishalr/GEMM
Fast Matrix Multiplication Implementation in C programming language. This matrix multiplication algorithm is similar to what Numpy uses to compute dot products.
Language:C24 2 02
szagoruyko/openai-gemm.pytorch
PyTorch bindings for openai-gemm
Language:Python20 4 04
mz24cn/gemm_optimization
The repository targets the OpenCL gemm function performance optimization. It compares several libraries clBLAS, clBLAST, MIOpenGemm, Intel MKL(CPU) and cuBLAS(CUDA) on different matrix sizes/vendor's hardwares/OS. Out-of-the-box easy as MSVC, MinGW, Linux(CentOS) x86_64 binary provided. 在不同矩阵大小/硬件/操作系统下比较几个BLAS库的sgemm函数性能，提供binary，开盒即用。
Language:C16 3 06
XiaoSong9905/dgemm-knl
DGEMM on KNL, achieve 75% MKL
Language:C++16 1 00
KarhouTam/cuda-kernels
Some common CUDA kernel implementations (Not the fastest).
Language:Cuda14 1 10
Bruce-Lee-LY/cuda_back2back_hgemm
Use tensor core to calculate back-to-back HGEMM (half-precision general matrix multiplication) with MMA PTX instruction.
Language:Cuda11 2 12
yui0/ugemm
GEMM
Language:C10 2 23
CambriconECO/BANGC_Gemm_Tutorial
Language:C++9 1 14
merledu/magma-si
Matrix Accelerator Generator for GeMM Operations based on SIGMA Architecture in CHISEL HDL
Language:Scala9 0 63
TensorBFS/CuTropicalGEMM.jl
The fastest Tropical number matrix multiplication on GPU
Language:Julia9 2 160
zixuanweeei/gemm-opt
Manually optimize the GEMM (GEneral Matrix Multiply) operation. There is a long way to go.
Language:C++8 2 10
foreverrookie/cuda-opt-samples
CUDA optimization samples including sgemm, reduce... To be continued.
Language:Cuda7 1 01

gemm

OpenNMT/CTranslate2

flame/how-to-optimize-gemm

DefTruth/CUDA-Learn-Notes

CNugteren/CLBlast

flame/blislab

Bruce-Lee-LY/cuda_hgemm

salykova/matmul.c

yzhaiustc/Optimizing-SGEMM-on-NVIDIA-Turing-GPUs

mratsim/laser

ROCm/Tensile

coderonion/awesome-cuda-and-hpc

cp2k/dbcsr

yui0/slibs

yzhaiustc/Optimizing-DGEMM-on-Intel-CPUs-with-AVX512F

ROCm/hipBLASLt

Bruce-Lee-LY/cuda_hgemv

enp1s0/ozIMMU

aredden/torch-cublas-hgemm

CoffeeBeforeArch/mmul

hma02/cublasHgemm-P100

andylolu2/simpleGEMM

BoooC/CNN-Accelerator-Based-on-Eyeriss-v2

hma02/cublasgemm-benchmark

eth-cscs/spla

iVishalr/GEMM

szagoruyko/openai-gemm.pytorch

mz24cn/gemm_optimization

XiaoSong9905/dgemm-knl

KarhouTam/cuda-kernels

Bruce-Lee-LY/cuda_back2back_hgemm

yui0/ugemm

CambriconECO/BANGC_Gemm_Tutorial

merledu/magma-si

TensorBFS/CuTropicalGEMM.jl

zixuanweeei/gemm-opt

foreverrookie/cuda-opt-samples