Pinned Repositories
cuda_back2back_hgemm
Use tensor core to calculate back-to-back HGEMM (half-precision general matrix multiplication) with MMA PTX instruction.
cuda_hgemm
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
cuda_hgemv
Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.
cuda_hook
Hooked CUDA-related dynamic libraries by using automated code generation tools.
cutlass_gemm
Multiple GEMM operators are constructed with cutlass to support LLM inference.
decoding_attention
Decoding Attention is specially optimized for multi head attention (MHA) using CUDA core for the decoding stage of LLM inference.
flash_attention_inference
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
matrix_multiply
Several common methods of matrix multiplication are implemented on CPU and Nvidia GPU using C++11 and CUDA.
memory_pool
Simple and efficient memory pool is implemented with C++11.
thread_pool
Thread pool is implemented to process task queue with C++11.
Bruce-Lee-LY's Repositories
Bruce-Lee-LY/cuda_hgemm
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
Bruce-Lee-LY/cuda_hook
Hooked CUDA-related dynamic libraries by using automated code generation tools.
Bruce-Lee-LY/cuda_hgemv
Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.
Bruce-Lee-LY/flash_attention_inference
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
Bruce-Lee-LY/decoding_attention
Decoding Attention is specially optimized for multi head attention (MHA) using CUDA core for the decoding stage of LLM inference.
Bruce-Lee-LY/cutlass_gemm
Multiple GEMM operators are constructed with cutlass to support LLM inference.
Bruce-Lee-LY/matrix_multiply
Several common methods of matrix multiplication are implemented on CPU and Nvidia GPU using C++11 and CUDA.
Bruce-Lee-LY/cuda_back2back_hgemm
Use tensor core to calculate back-to-back HGEMM (half-precision general matrix multiplication) with MMA PTX instruction.
Bruce-Lee-LY/memory_pool
Simple and efficient memory pool is implemented with C++11.
Bruce-Lee-LY/thread_pool
Thread pool is implemented to process task queue with C++11.
Bruce-Lee-LY/deep_learning
Implemented the training and inference of several common deep learning model algorithms with tensorflow and pytorch.
Bruce-Lee-LY/algorithm_design
Use several algorithm design methods to solve several common problems with C++11.
Bruce-Lee-LY/crawler
Several fun crawler cases implemented in Python.
Bruce-Lee-LY/data_structure
Several commonly used data structures are implemented with C++11.
Bruce-Lee-LY/machine_learning
Implement several common machine learning algorithms with sklearn.