qelk123's Stars
microsoft/triton-shared
Shared Middle-Layer for Triton Compilation
TiledTensor/TiledCUDA
TiledCUDA is a highly efficient kernel template library designed to elevate CUDA C’s level of abstraction for processing tiles.
krahets/hello-algo
《Hello 算法》:动画图解、一键运行的数据结构与算法教程。支持 Python, Java, C++, C, C#, JS, Go, Swift, Rust, Ruby, Kotlin, TS, Dart 代码。简体版和繁体版同步更新,English version ongoing
eillsu/iTerm2-Chinese-Tutorial
iTerm2 中文教程
KnowingNothing/MatmulTutorial
A Easy-to-understand TensorOp Matmul Tutorial
epfml/dynamic-sparse-flash-attention
Liu-xiandong/How_to_optimize_in_GPU
This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, sgemv, sgemm, etc. The performance of these kernels is basically at or near the theoretical limit.
karpathy/llm.c
LLM training in simple, raw C/CUDA
flashinfer-ai/flashinfer
FlashInfer: Kernel Library for LLM Serving
sjfeng1999/gpu-arch-microbenchmark
Dissecting NVIDIA GPU Architecture
j2kun/mlir-tutorial
MLIR For Beginners tutorial
llvm/torch-mlir
The Torch-MLIR project aims to provide first class support from the PyTorch ecosystem to the MLIR ecosystem.
bytedance/byteir
A model compilation solution for various hardware
meta-llama/llama
Inference code for Llama models
NVlabs/NVBit
zwang4/awesome-machine-learning-in-compilers
Must read research papers and links to tools and datasets that are related to using machine learning for compilers and systems optimisation
UniHD-CEG/cuda-flux
CUDA Flux is a profiler for GPU applications which reports the basic block executions frequencies of compute kernels
hpc-ulisboa/gpuPTXModel
GPU Static Modeling using PTX and Deep Structured Learning
lanl/PPT
Performance Prediction Toolkit
UniHD-CEG/gpu-mangrove
machine learning model for execution time and power prediction of CUDA kernels
sderek/CUDAAdvisor
CUDAAdvisor: a GPU profiling tool
gpgpu-sim/gpgpu-sim_distribution
GPGPU-Sim provides a detailed simulation model of contemporary NVIDIA GPUs running CUDA and/or OpenCL workloads. It includes support for features such as TensorCores and CUDA Dynamic Parallelism as well as a performance visualization tool, AerialVisoin, and an integrated energy model, GPUWattch.
NVIDIA/cccl
CUDA Core Compute Libraries
gem5/gem5
The official repository for the gem5 computer-system architecture simulator.
arrayfire/arrayfire
ArrayFire: a general purpose GPU library.
flexflow/FlexFlow
FlexFlow Serve: Low-Latency, High-Performance LLM Serving
NervanaSystems/maxas
Assembler for NVIDIA Maxwell architecture
SuperScientificSoftwareLaboratory/DASP
Source code of the SC '23 paper: "DASP: Specific Dense Matrix Multiply-Accumulate Units Accelerated General Sparse Matrix-Vector Multiplication" by Yuechen Lu and Weifeng Liu.
kokkos/kokkos
Kokkos C++ Performance Portability Programming Ecosystem: The Programming Model - Parallel Execution and Memory Abstraction
LC044/WeChatMsg
提取微信聊天记录,将其导出成HTML、Word、Excel文档永久保存,对聊天记录进行分析生成年度聊天报告,用聊天数据训练专属于个人的AI聊天助手