shaojiewang

GPU high perf computing & AI compiler

AMDshanghai

Pinned Repositories

ait_learn
learn aitemplate code
01
AITemplate
AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (NVIDIA GPU) and MatrixCore (AMD GPU) inference.
Language:Python0 1 00
awesome-tensor-compilers
A list of awesome compiler projects and papers for tensor computation and deep learning.
00
buy_now_script
buy things from taobao web
Language:Python31
ck-fa-bwd-dev
Language:C++00
collect_perf_data
Collect performance data for CK/MISA/MIOpen to fast create presentation sheet.
Language:Python00
gcnasm
Language:C++00
gpu_analyze_helper
Help to check gpu kernel's shared mem
Language:Python20
HIP-Performance-Optmization-on-VEGA64
14 basic topics for VEGA64 performance optmization
Language:C++10
winograd_conv_gfx908
To develop winograd convolution algorithm for gfx908 GPU
Language:Python20

shaojiewang's Repositories

shaojiewang/ait_learn
learn aitemplate code
01
shaojiewang/AITemplate
AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (NVIDIA GPU) and MatrixCore (AMD GPU) inference.
Language:Python0 1 00
shaojiewang/ck-fa-bwd-dev
Language:C++00
shaojiewang/composable_kernel
Composable Kernel: Performance Portable Programming Model for Machine Learning Tensor Operators
Language:C++00
shaojiewang/gcnasm
Language:C++00
shaojiewang/cutlass
CUDA Templates for Linear Algebra Subroutines
Language:C++
shaojiewang/FasterTransformer
Transformer related optimization, including BERT, GPT
Language:C++
shaojiewang/FBGEMM
FB (Facebook) + GEMM (General Matrix-Matrix Multiplication) - https://code.fb.com/ml-applications/fbgemm/
Language:C++
shaojiewang/gpu_image_processing
gpu coding practice
Language:C++2 0
shaojiewang/GPUBenchmark
A performance benchmark for GPGPU or GPU based AIChips.
Language:C++
shaojiewang/hopper-gpu-inst-peak
Language:Sass
shaojiewang/llama
Inference code for LLaMA models
shaojiewang/llm-awq
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
shaojiewang/llm.c
LLM training in simple, raw C/CUDA
shaojiewang/lmdeploy
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
Language:C++
shaojiewang/Megatron-LM
Ongoing research training transformer models at scale
Language:Python
shaojiewang/multi-gpu-programming-models
Examples demonstrating available options to program multiple GPUs in a single node or a cluster
Language:Cuda
shaojiewang/onnxruntime
ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
Language:C++
shaojiewang/Paddle
PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice （『飞桨』核心框架，深度学习&机器学习高性能单机、分布式训练和跨平台部署）
Language:C++
shaojiewang/pytorch
Tensors and Dynamic neural networks in Python with strong GPU acceleration
Language:Python
shaojiewang/rccl
ROCm Communication Collectives Library (RCCL)
Language:C++0 0
shaojiewang/rccl-tests
RCCL Performance Benchmark Tests
Language:Cuda
shaojiewang/Tensile
Stretching GPU performance for GEMMs and tensor contractions.
Language:Python0 0
shaojiewang/TensorRT
NVIDIA® TensorRT™, an SDK for high-performance deep learning inference, includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for inference applications.
Language:C++0 0
shaojiewang/tinygrad
You like pytorch? You like micrograd? You love tinygrad! ❤️
Language:Python0 0
shaojiewang/torch_learn
learn pytorch 2.0, especially __dynamo/inductor method
Language:Python
shaojiewang/TransformerEngine
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.
shaojiewang/vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
Language:Python
shaojiewang/vllm-rocm
Language:Python
shaojiewang/vpncn.github.io
2021**翻墙软件VPN推荐指南，以及对比VPS搭建梯子、SSR机场、蓝灯、WireGuard、V2ray、老王VPN等科学上网软件与翻墙方法，**最新科学上网翻墙VPN梯子下载推荐，稳定好用。
Language:HTML