Zhiwei35

Interested in DL fwk and compiler SW stack research

Intel

Pinned Repositories

neural-compressor
SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime
Language:Python2.2k 34 201252
Paddle
PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice （『飞桨』核心框架，深度学习&机器学习高性能单机、分布式训练和跨平台部署）
Language:C++22.1k 717 18.3k5.6k
Awesome-GPU
Awesome resources for GPUs
0 0 00
code-samples
Source code examples from the Parallel Forall Blog
Language:HTML00
Cpp_houjie
侯捷C++课程PPT及代码
Language:C++0 0 01
CPP_Optimizations_Diary
Tips and tricks to optimize your C++ code
Language:C++00
cutlass
CUDA Templates for Linear Algebra Subroutines
Language:C++01
flash_attention_inference
compressed version of flash attn to flash decoding
Language:C++00
GPU_Microbenchmark
Language:Cuda0 0 00
PLCT-Open-Reports
PLCT实验室有关RISC-V和MLIR的slides和report
10

Zhiwei35's Repositories

Zhiwei35/PLCT-Open-Reports
PLCT实验室有关RISC-V和MLIR的slides和report
10
Zhiwei35/Awesome-GPU
Awesome resources for GPUs
0 0 00
Zhiwei35/code-samples
Source code examples from the Parallel Forall Blog
Language:HTML00
Zhiwei35/Cpp_houjie
侯捷C++课程PPT及代码
Language:C++0 0 01
Zhiwei35/CPP_Optimizations_Diary
Tips and tricks to optimize your C++ code
Language:C++00
Zhiwei35/cutlass
CUDA Templates for Linear Algebra Subroutines
Language:C++01
Zhiwei35/DeepLearningSystem
Deep Learning System core principles introduction.
Language:Jupyter Notebook00
Zhiwei35/flash_attention_inference
compressed version of flash attn to flash decoding
Language:C++00
Zhiwei35/GPU_Microbenchmark
Language:Cuda0 0 00
Zhiwei35/HPCInfo
Information about many aspects of high-performance computing. Wiki content moved to ~/docs.
Language:C++00
Zhiwei35/IOS
[MLSys 2021] IOS: Inter-Operator Scheduler for CNN Acceleration
Language:C++00
Zhiwei35/Megatron-LM
Ongoing research training transformer models at scale
Language:Python0 0 00
Zhiwei35/modern-cpp-tutorial
📚 Modern C++ Tutorial: C++11/14/17/20 On the Fly | https://changkun.de/modern-cpp/
Language:C++00
Zhiwei35/MyTinySTL
STL class impl in C++11
Language:C++00
Zhiwei35/llama.cpp
Pure C/C++ LLaMA
Zhiwei35/LLM_final
Language:Cuda
Zhiwei35/megablocks
Zhiwei35/nnfusion
A flexible and efficient deep neural network (DNN) compiler that generates high-performance executable from a DNN model description.
Language:C++0 0
Zhiwei35/Optimizing-DGEMM-on-Intel-CPUs-with-AVX512F
Stepwise optimizations of DGEMM on CPU, reaching performance faster than Intel MKL eventually, even under multithreading.
Zhiwei35/OptimizingSeriesTranslation
Chinese version for Agner Fog's optimizing series
0 0
Zhiwei35/optimum-habana
Easy and lightning fast training of 🤗 Transformers on Habana Gaudi processor (HPU)
Zhiwei35/PaddleCustomDevice
PaddlePaddle custom device implementaion. (『飞桨』自定义硬件接入实现)
Language:Python0 0
Zhiwei35/train-LeNet5-by-cuda
train a LeNet5 with Cuda

Zhiwei35

Pinned Repositories

neural-compressor

Paddle

Awesome-GPU

code-samples

Cpp_houjie

CPP_Optimizations_Diary

cutlass

flash_attention_inference

GPU_Microbenchmark

PLCT-Open-Reports

Zhiwei35's Repositories

Zhiwei35/PLCT-Open-Reports

Zhiwei35/Awesome-GPU

Zhiwei35/code-samples

Zhiwei35/Cpp_houjie

Zhiwei35/CPP_Optimizations_Diary

Zhiwei35/cutlass

Zhiwei35/DeepLearningSystem

Zhiwei35/flash_attention_inference

Zhiwei35/GPU_Microbenchmark

Zhiwei35/HPCInfo

Zhiwei35/IOS

Zhiwei35/Megatron-LM

Zhiwei35/modern-cpp-tutorial

Zhiwei35/MyTinySTL

Zhiwei35/llama.cpp

Zhiwei35/LLM_final

Zhiwei35/megablocks

Zhiwei35/nnfusion

Zhiwei35/Optimizing-DGEMM-on-Intel-CPUs-with-AVX512F

Zhiwei35/OptimizingSeriesTranslation

Zhiwei35/optimum-habana

Zhiwei35/PaddleCustomDevice

Zhiwei35/train-LeNet5-by-cuda