Miroier's Stars
thu-ml/SageAttention
Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
tlc-pack/libflash_attn
Standalone Flash Attention v2 kernel without libtorch dependency
google/bloaty
Bloaty: a size profiler for binaries
srush/Triton-Puzzles
Puzzles for learning Triton
ZonePG/cs-notes
my cs notes
tgale96/grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
bytedance/ABQ-LLM
An acceleration library that supports arbitrary bit-width combinatorial quantization operations
FlagOpen/FlagGems
FlagGems is an operator library for large language models implemented in Triton Language.
bytedance/flux
A fast communication-overlapping library for tensor parallelism on GPUs.
kendryte/nncase
Open deep learning compiler stack for Kendryte AI accelerators ✨
feifeibear/LLMSpeculativeSampling
Fast inference from large lauguage models via speculative decoding
HazyResearch/ThunderKittens
Tile primitives for speedy kernels
KnowingNothing/compiler-and-arch
A list of tutorials, paper, talks, and open-source projects for emerging compiler and architecture
TiledTensor/TiledCUDA
TiledCUDA is a highly efficient kernel template library designed to elevate CUDA C’s level of abstraction for processing tiles.
Sergei-Korneev/obsidian-local-images-plus
This repo is a reincarnation of obsidian-local-images plugin which main aim was downloading images in md notes to local storage.
FoundationVision/VAR
[NeurIPS 2024 Oral][GPT beats diffusion🔥] [scaling laws in visual generation📈] Official impl. of "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction". An *ultra-simple, user-friendly yet state-of-the-art* codebase for autoregressive image generation!
xdit-project/xDiT
xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism
FlagOpen/FlagScale
FlagScale is a large model toolkit based on open-sourced projects.
online-judge-tools/verification-helper
a testing framework for snippet libraries used in competitive programming
wting/autojump
A cd command that learns - easily navigate directories from the command line
alangrainger/share-note
Instantly share an Obsidian note with the full theme exactly like you see in your vault. Data is shared encrypted by default, and only you and the person you send it to have the key.
pengsida/learning_research
本人的科研经验
Jack47/hack-SysML
The road to hack SysML and become an system expert
weishengying/cutlass_flash_atten_fp8
使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention
CaiJimmy/hugo-theme-stack
Card-style Hugo theme designed for bloggers
ELS-RD/kernl
Kernl lets you run PyTorch transformer models several times faster on GPU with a single line of code, and is designed to be easily hackable.
ROCm/rocWMMA
rocWMMA
AnswerDotAI/gpu.cpp
A lightweight library for portable low-level GPU computation using WebGPU.
te42kyfo/gpu-benches
collection of benchmarks to measure basic GPU capabilities
sjfeng1999/gpu-arch-microbenchmark
Dissecting NVIDIA GPU Architecture