Miroier

Miroier's Stars

thu-ml/SageAttention
Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
Language:Cuda67233
tlc-pack/libflash_attn
Standalone Flash Attention v2 kernel without libtorch dependency
Language:C++9913
google/bloaty
Bloaty: a size profiler for binaries
Language:C++4.8k348
srush/Triton-Puzzles
Puzzles for learning Triton
Language:Jupyter Notebook1.2k92
ZonePG/cs-notes
my cs notes
Language:Jupyter Notebook272
tgale96/grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
Language:Cuda5740
bytedance/ABQ-LLM
An acceleration library that supports arbitrary bit-width combinatorial quantization operations
Language:C++22925
FlagOpen/FlagGems
FlagGems is an operator library for large language models implemented in Triton Language.
Language:Python35852
bytedance/flux
A fast communication-overlapping library for tensor parallelism on GPUs.
Language:C++23719
kendryte/nncase
Open deep learning compiler stack for Kendryte AI accelerators ✨
Language:C#756185
feifeibear/LLMSpeculativeSampling
Fast inference from large lauguage models via speculative decoding
Language:Python60063
HazyResearch/ThunderKittens
Tile primitives for speedy kernels
Language:Cuda1.7k78
KnowingNothing/compiler-and-arch
A list of tutorials, paper, talks, and open-source projects for emerging compiler and architecture
40135
TiledTensor/TiledCUDA
TiledCUDA is a highly efficient kernel template library designed to elevate CUDA C’s level of abstraction for processing tiles.
Language:C++16710
Sergei-Korneev/obsidian-local-images-plus
This repo is a reincarnation of obsidian-local-images plugin which main aim was downloading images in md notes to local storage.
Language:TypeScript27021
FoundationVision/VAR
[NeurIPS 2024 Oral][GPT beats diffusion🔥] [scaling laws in visual generation📈] Official impl. of "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction". An *ultra-simple, user-friendly yet state-of-the-art* codebase for autoregressive image generation!
Language:Python6.1k407
xdit-project/xDiT
xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism
Language:Python91679
FlagOpen/FlagScale
FlagScale is a large model toolkit based on open-sourced projects.
Language:Python19449
online-judge-tools/verification-helper
a testing framework for snippet libraries used in competitive programming
Language:Python23356
wting/autojump
A cd command that learns - easily navigate directories from the command line
Language:Python16.3k705
alangrainger/share-note
Instantly share an Obsidian note with the full theme exactly like you see in your vault. Data is shared encrypted by default, and only you and the person you send it to have the key.
Language:TypeScript33717
pengsida/learning_research
本人的科研经验
6.1k358
Jack47/hack-SysML
The road to hack SysML and become an system expert
Language:Emacs Lisp44754
weishengying/cutlass_flash_atten_fp8
使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention
Language:Cuda533
CaiJimmy/hugo-theme-stack
Card-style Hugo theme designed for bloggers
Language:HTML5.1k1.7k
ELS-RD/kernl
Kernl lets you run PyTorch transformer models several times faster on GPU with a single line of code, and is designed to be easily hackable.
Language:Jupyter Notebook1.5k94
ROCm/rocWMMA
rocWMMA
Language:C++9426
AnswerDotAI/gpu.cpp
A lightweight library for portable low-level GPU computation using WebGPU.
Language:C++3.8k177
te42kyfo/gpu-benches
collection of benchmarks to measure basic GPU capabilities
Language:Jupyter Notebook26441
sjfeng1999/gpu-arch-microbenchmark
Dissecting NVIDIA GPU Architecture
Language:Cuda8225