Summer-Summer
Machine Learning System & Software-hardware Co-design
University of SydneySydney NSW, Australia
Pinned Repositories
flash-llm
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
ALCFBeginnersGuide
bitsandbytes
8-bit CUDA functions for PyTorch
ComputerArchitectureLab
This repository is used to release the Labs of Computer Architecture Course from USTC
cutlass-kernels
DeepSpeed
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
flash-llm
llm-awq
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
README
README文件语法解读,即Github Flavored Markdown语法介绍
fp6_llm
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
Summer-Summer's Repositories
Summer-Summer/ComputerArchitectureLab
This repository is used to release the Labs of Computer Architecture Course from USTC
Summer-Summer/README
README文件语法解读,即Github Flavored Markdown语法介绍
Summer-Summer/ALCFBeginnersGuide
Summer-Summer/bitsandbytes
8-bit CUDA functions for PyTorch
Summer-Summer/cutlass-kernels
Summer-Summer/DeepSpeed
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
Summer-Summer/flash-llm
Summer-Summer/llm-awq
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
Summer-Summer/Master-s-Thesis
Summer-Summer/nersc-roofline
Summer-Summer/SparTA
Summer-Summer/sputnik
A library of GPU kernels for sparse matrix operations.
Summer-Summer/vectorSparse
Summer-Summer/quant-matmul
Summer-Summer/TensorRT-LLM
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.