saberlililily

University of Chinese Academy of Science

saberlililily's Stars

microsoft/BitBLAS
BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.
Language:Python44534
neuralmagic/AutoFP8
Language:Python16423
Azure/MS-AMP
Microsoft Automatic Mixed Precision Library
Language:Python52842
OpenPPL/ppq
PPL Quantization Tool (PPQ) is a powerful offline neural network quantization tool.
Language:Python1.6k238
usyd-fsalab/fp6_llm
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
Language:Cuda21916
lucidrains/vit-pytorch
Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch
Language:Python21.1k3.1k
project-numina/aimo-progress-prize
Language:Jupyter Notebook34026
NVIDIA/TensorRT-Model-Optimizer
TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization, pruning, distillation, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed on NVIDIA GPUs.
Language:Python62445
DD-DuDa/TensorRT-in-Action
TensorRT-in-Action 是一个 GitHub 代码库，提供了使用 TensorRT 的代码示例，并有对应 Jupyter Notebook。
Language:Jupyter Notebook132
NVIDIA/TransformerEngine
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.
Language:Python2k334
IntelLabs/FP8-Emulation-Toolkit
PyTorch extension for emulating FP8 data formats on standard FP32 Xeon/GPU hardware.
Language:Python10110
DefTruth/CUDA-Learn-Notes
📚150+ Tensor/CUDA Cores Kernels, ⚡️flash-attention-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS 🎉🎉).
Language:Cuda1.6k175
aredden/flux-fp8-api
Flux diffusion model implementation using quantized fp8 matmul & remaining layers use faster half precision accumulate, which is ~2x faster on consumer devices.
Language:Python21625
intel/neural-compressor
SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime
Language:Python2.3k258
itayhubara/BinaryNet.pytorch
Binarized Neural Network (BNN) for pytorch
Language:Python508126
cooooorn/Pytorch-XNOR-Net
XNOR-Net, with binary gemm and binary conv2d kernels, support both CPU and GPU.
Language:Python8423
jiecaoyu/XNOR-Net-PyTorch
PyTorch Implementation of XNOR-Net
Language:Python483120
peiswang/SiBNN
Language:Python2
awai54st/LUTNet
Language:Verilog5617
WangXuan95/BSV_Tutorial_cn
一篇全面的 Bluespec SystemVerilog (BSV) 中文教程，介绍了BSV的调度、FIFO数据流、多态等高级特性，展示了BSV相比于传统Verilog开发的优势。
Language:Bluespec53343
rayleizhu/vllm-ra
[ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts
Language:Python353
kyspyridon/FP_Adder
Designed, using Verilog, a single cycle and a 2-stage pipelined version of a Floating Point Adder according to the IEEE-754 format. This project is designed to target a Xilinx Zedboard. To test our implementation on the actual hardware, we used detachable 7-segment displays.
Language:Verilog1
shahsaumya00/Floating-Point-Adder
32 bit pipelined binary floating point adder using IEEE-754 Single Precision Format in Verilog
Language:Verilog145
erihsu/INT_FP_MAC
INT8 & FP16 multiplier accumulator (MAC) design with UVM verification completed.
Language:Verilog8412
DongbeomSon/fp16MAC
Language:Verilog2
JulianKemmerer/PipelineC
A C-like hardware description language (HDL) adding high level synthesis(HLS)-like automatic pipelining as a language construct/compiler feature.
Language:VHDL60950
google/xls
XLS: Accelerated HW Synthesis
Language:C++1.2k182
dawsonjon/fpu
synthesiseable ieee 754 floating point library in verilog
Language:Verilog539145
robfinch/Float
Floating point code in System Verilog
Language:SystemVerilog82
openhwgroup/cvfpu
Parametric floating-point unit with support for standard RISC-V formats and operations as well as transprecision formats.
Language:SystemVerilog440117

saberlililily

saberlililily's Stars

microsoft/BitBLAS

neuralmagic/AutoFP8

Azure/MS-AMP

OpenPPL/ppq

usyd-fsalab/fp6_llm

lucidrains/vit-pytorch

project-numina/aimo-progress-prize

NVIDIA/TensorRT-Model-Optimizer

DD-DuDa/TensorRT-in-Action

NVIDIA/TransformerEngine

IntelLabs/FP8-Emulation-Toolkit

DefTruth/CUDA-Learn-Notes

aredden/flux-fp8-api

intel/neural-compressor

itayhubara/BinaryNet.pytorch

cooooorn/Pytorch-XNOR-Net

jiecaoyu/XNOR-Net-PyTorch

peiswang/SiBNN

awai54st/LUTNet

WangXuan95/BSV_Tutorial_cn

rayleizhu/vllm-ra

kyspyridon/FP_Adder

shahsaumya00/Floating-Point-Adder

erihsu/INT_FP_MAC

DongbeomSon/fp16MAC

JulianKemmerer/PipelineC

google/xls

dawsonjon/fpu

robfinch/Float

openhwgroup/cvfpu