Pinned Repositories
neural-compressor
SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime
Paddle
PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
Awesome-GPU
Awesome resources for GPUs
code-samples
Source code examples from the Parallel Forall Blog
Cpp_houjie
侯捷C++课程PPT及代码
CPP_Optimizations_Diary
Tips and tricks to optimize your C++ code
cutlass
CUDA Templates for Linear Algebra Subroutines
flash_attention_inference
compressed version of flash attn to flash decoding
GPU_Microbenchmark
PLCT-Open-Reports
PLCT实验室有关RISC-V和MLIR的slides和report
Zhiwei35's Repositories
Zhiwei35/PLCT-Open-Reports
PLCT实验室有关RISC-V和MLIR的slides和report
Zhiwei35/Awesome-GPU
Awesome resources for GPUs
Zhiwei35/code-samples
Source code examples from the Parallel Forall Blog
Zhiwei35/Cpp_houjie
侯捷C++课程PPT及代码
Zhiwei35/CPP_Optimizations_Diary
Tips and tricks to optimize your C++ code
Zhiwei35/cutlass
CUDA Templates for Linear Algebra Subroutines
Zhiwei35/DeepLearningSystem
Deep Learning System core principles introduction.
Zhiwei35/flash_attention_inference
compressed version of flash attn to flash decoding
Zhiwei35/GPU_Microbenchmark
Zhiwei35/HPCInfo
Information about many aspects of high-performance computing. Wiki content moved to ~/docs.
Zhiwei35/IOS
[MLSys 2021] IOS: Inter-Operator Scheduler for CNN Acceleration
Zhiwei35/Megatron-LM
Ongoing research training transformer models at scale
Zhiwei35/modern-cpp-tutorial
📚 Modern C++ Tutorial: C++11/14/17/20 On the Fly | https://changkun.de/modern-cpp/
Zhiwei35/MyTinySTL
STL class impl in C++11
Zhiwei35/llama.cpp
Pure C/C++ LLaMA
Zhiwei35/LLM_final
Zhiwei35/megablocks
Zhiwei35/nnfusion
A flexible and efficient deep neural network (DNN) compiler that generates high-performance executable from a DNN model description.
Zhiwei35/Optimizing-DGEMM-on-Intel-CPUs-with-AVX512F
Stepwise optimizations of DGEMM on CPU, reaching performance faster than Intel MKL eventually, even under multithreading.
Zhiwei35/OptimizingSeriesTranslation
Chinese version for Agner Fog's optimizing series
Zhiwei35/optimum-habana
Easy and lightning fast training of 🤗 Transformers on Habana Gaudi processor (HPU)
Zhiwei35/PaddleCustomDevice
PaddlePaddle custom device implementaion. (『飞桨』自定义硬件接入实现)
Zhiwei35/train-LeNet5-by-cuda
train a LeNet5 with Cuda