gemm
There are 89 repositories under gemm topic.
OpenNMT/CTranslate2
Fast inference engine for Transformer models
CNugteren/CLBlast
Tuned OpenCL BLAS
flame/blislab
BLISlab: A Sandbox for Optimizing GEMM
Bruce-Lee-LY/cuda_hgemm
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
yzhaiustc/Optimizing-SGEMM-on-NVIDIA-Turing-GPUs
Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.
salykova/sgemm.c
Multi-Threaded FP32 Matrix Multiplication on x86 CPUs
coderonion/awesome-cuda-and-hpc
🚀🚀🚀 This repository lists some awesome public CUDA, cuda-python, cuBLAS, cuDNN, CUTLASS, TensorRT, TensorRT-LLM, Triton, TVM, MLIR, PTX and High Performance Computing (HPC) projects.
mratsim/laser
The HPC toolbox: fused matrix multiplication, convolution, data-parallel strided tensor primitives, OpenMP facilities, SIMD, JIT Assembler, CPU detection, state-of-the-art vectorized BLAS for floats and integers
ROCm/Tensile
[DEPRECATED] Moved to ROCm/rocm-libraries repo
yzhaiustc/Optimizing-DGEMM-on-Intel-CPUs-with-AVX512F
Stepwise optimizations of DGEMM on CPU, reaching performance faster than Intel MKL eventually, even under multithreading.
cp2k/dbcsr
DBCSR: Distributed Block Compressed Sparse Row matrix library
yui0/slibs
Single file libraries for C/C++
ROCm/hipBLASLt
[DEPRECATED] Moved to ROCm/rocm-libraries repo
BoooC/CNN-Accelerator-Based-on-Eyeriss-v2
A Flexible and Energy Efficient Accelerator For Sparse Convolution Neural Network
enp1s0/ozIMMU
FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme
aredden/torch-cublas-hgemm
PyTorch half precision gemm lib w/ fused optional bias + optional relu/gelu
Bruce-Lee-LY/cuda_hgemv
Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.
ROCm/iris
AMD RAD's experimental RMA library for Triton.
CoffeeBeforeArch/mmul
Serial and parallel implementations of matrix multiplication
andylolu2/simpleGEMM
The simplest but fast implementation of matrix multiplication in CUDA.
iVishalr/GEMM
Fast Matrix Multiplication Implementation in C programming language. This matrix multiplication algorithm is similar to what Numpy uses to compute dot products.
hma02/cublasHgemm-P100
Code for testing the native float16 matrix multiplication performance on Tesla P100 and V100 GPU based on cublasHgemm
hma02/cublasgemm-benchmark
code for benchmarking GPU performance based on cublasSgemm and cublasHgemm
eth-cscs/spla
Specialized Parallel Linear Algebra, providing distributed GEMM functionality for specific matrix distributions with optional GPU acceleration.
KarhouTam/cuda-kernels
Some common CUDA kernel implementations (Not the fastest).
szagoruyko/openai-gemm.pytorch
PyTorch bindings for openai-gemm
Bruce-Lee-LY/cutlass_gemm
Multiple GEMM operators are constructed with cutlass to support LLM inference.
soran-ghaderi/cuRBLAS
🍒 cuRBLAS (Randomized BLAS) is a GPU-accelerated library for accelerating AI and HPC applications.
XiaoSong9905/dgemm-knl
DGEMM on KNL, achieve 75% MKL
mz24cn/gemm_optimization
The repository targets the OpenCL gemm function performance optimization. It compares several libraries clBLAS, clBLAST, MIOpenGemm, Intel MKL(CPU) and cuBLAS(CUDA) on different matrix sizes/vendor's hardwares/OS. Out-of-the-box easy as MSVC, MinGW, Linux(CentOS) x86_64 binary provided. 在不同矩阵大小/硬件/操作系统下比较几个BLAS库的sgemm函数性能,提供binary,开盒即用。
enp1s0/cuMpSGEMM
Fast SGEMM emulation on Tensor Cores
merledu/magma-si
Matrix Accelerator Generator for GeMM Operations based on SIGMA Architecture in CHISEL HDL
ROCm/tritonBLAS
A lightweight triton-based General Matrix Multiplication (GEMM) library.
Bruce-Lee-LY/cuda_back2back_hgemm
Use tensor core to calculate back-to-back HGEMM (half-precision general matrix multiplication) with MMA PTX instruction.