Pinned Repositories
ait_learn
learn aitemplate code
AITemplate
AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (NVIDIA GPU) and MatrixCore (AMD GPU) inference.
awesome-tensor-compilers
A list of awesome compiler projects and papers for tensor computation and deep learning.
buy_now_script
buy things from taobao web
ck-fa-bwd-dev
collect_perf_data
Collect performance data for CK/MISA/MIOpen to fast create presentation sheet.
gcnasm
gpu_analyze_helper
Help to check gpu kernel's shared mem
HIP-Performance-Optmization-on-VEGA64
14 basic topics for VEGA64 performance optmization
winograd_conv_gfx908
To develop winograd convolution algorithm for gfx908 GPU
shaojiewang's Repositories
shaojiewang/ait_learn
learn aitemplate code
shaojiewang/AITemplate
AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (NVIDIA GPU) and MatrixCore (AMD GPU) inference.
shaojiewang/ck-fa-bwd-dev
shaojiewang/composable_kernel
Composable Kernel: Performance Portable Programming Model for Machine Learning Tensor Operators
shaojiewang/gcnasm
shaojiewang/cutlass
CUDA Templates for Linear Algebra Subroutines
shaojiewang/FasterTransformer
Transformer related optimization, including BERT, GPT
shaojiewang/FBGEMM
FB (Facebook) + GEMM (General Matrix-Matrix Multiplication) - https://code.fb.com/ml-applications/fbgemm/
shaojiewang/gpu_image_processing
gpu coding practice
shaojiewang/GPUBenchmark
A performance benchmark for GPGPU or GPU based AIChips.
shaojiewang/hopper-gpu-inst-peak
shaojiewang/llama
Inference code for LLaMA models
shaojiewang/llm-awq
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
shaojiewang/llm.c
LLM training in simple, raw C/CUDA
shaojiewang/lmdeploy
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
shaojiewang/Megatron-LM
Ongoing research training transformer models at scale
shaojiewang/multi-gpu-programming-models
Examples demonstrating available options to program multiple GPUs in a single node or a cluster
shaojiewang/onnxruntime
ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
shaojiewang/Paddle
PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
shaojiewang/pytorch
Tensors and Dynamic neural networks in Python with strong GPU acceleration
shaojiewang/rccl
ROCm Communication Collectives Library (RCCL)
shaojiewang/rccl-tests
RCCL Performance Benchmark Tests
shaojiewang/Tensile
Stretching GPU performance for GEMMs and tensor contractions.
shaojiewang/TensorRT
NVIDIA® TensorRT™, an SDK for high-performance deep learning inference, includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for inference applications.
shaojiewang/tinygrad
You like pytorch? You like micrograd? You love tinygrad! ❤️
shaojiewang/torch_learn
learn pytorch 2.0, especially __dynamo/inductor method
shaojiewang/TransformerEngine
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.
shaojiewang/vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
shaojiewang/vllm-rocm
shaojiewang/vpncn.github.io
2021**翻墙软件VPN推荐指南,以及对比VPS搭建梯子、SSR机场、蓝灯、WireGuard、V2ray、老王VPN等科学上网软件与翻墙方法,**最新科学上网翻墙VPN梯子下载推荐,稳定好用。