StudyingShao

NVIDIA

Pinned Repositories

llm-awq
[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
Language:Python2.6k 25 189214
TransformerEngine
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.
Language:Python2k 33 366336
AutoGPTQ
An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
Language:Python00
grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
Language:C++00
marlin
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
Language:Python00
nvidia-modelopt
Language:Python00
QuaRot
Code for QuaRot, an end-to-end 4-bit inference of large language models.
Language:Python00
TensorRT-LLM
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
Language:C++30
tensorrtllm_backend
The Triton TensorRT-LLM Backend
Language:Python00
TransformerEngine
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.
Language:Python00

StudyingShao's Repositories

StudyingShao/TensorRT-LLM
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
Language:C++30
StudyingShao/AutoGPTQ
An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
Language:Python00
StudyingShao/grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
Language:C++00
StudyingShao/marlin
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
Language:Python00
StudyingShao/nvidia-modelopt
Language:Python00
StudyingShao/QuaRot
Code for QuaRot, an end-to-end 4-bit inference of large language models.
Language:Python00
StudyingShao/tensorrtllm_backend
The Triton TensorRT-LLM Backend
Language:Python00
StudyingShao/TransformerEngine
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.
Language:Python00

StudyingShao

Pinned Repositories

llm-awq

TransformerEngine

AutoGPTQ

grouped_gemm

marlin

nvidia-modelopt

QuaRot

TensorRT-LLM

tensorrtllm_backend

TransformerEngine

StudyingShao's Repositories

StudyingShao/TensorRT-LLM

StudyingShao/AutoGPTQ

StudyingShao/grouped_gemm

StudyingShao/marlin

StudyingShao/nvidia-modelopt

StudyingShao/QuaRot

StudyingShao/tensorrtllm_backend

StudyingShao/TransformerEngine