Pinned Repositories
Awesome-GPU
Awesome resources for GPUs
cmake-examples
Useful CMake Examples
CUDA-PPT
cute-gemm
LearnDLSysCourse
Learning_CUDA
MadMario-OneFlow
oneflow
OneFlow is a performance-centered and open-source deep learning framework.
paper_reading
Tools
Collect some useful code.
MARD1NO's Repositories
MARD1NO/CUDA-PPT
MARD1NO/OneshotAllreduceExample
MARD1NO/open-resume
OpenResume is a powerful open-source resume builder and resume parser. https://open-resume.com/
MARD1NO/tutorial-multi-gpu
Efficient Distributed GPU Programming for Exascale, an SC/ISC Tutorial
MARD1NO/cutlass_master
CUDA Templates for Linear Algebra Subroutines
MARD1NO/DeepSpeed
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
MARD1NO/Awesome-LLM-System-Papers
MARD1NO/ByteTransformer
optimized BERT transformer inference on NVIDIA GPU. https://arxiv.org/abs/2210.03052
MARD1NO/CUDALibrarySamples
CUDA Library Samples
MARD1NO/docs
Documentations for PaddlePaddle
MARD1NO/dynolog
Dynolog is a telemetry daemon for performance monitoring and tracing. It exports metrics from different components in the system like the linux kernel, CPU, disks, Intel PT, GPUs etc. Dynolog also integrates with pytorch and can trigger traces for distributed training applications.
MARD1NO/EdgeGPT
Reverse engineered API of Microsoft's Bing Chat AI
MARD1NO/FlexGen
Running large language models like OPT-175B/GPT-3 on a single GPU. Up to 100x faster than other offloading systems.
MARD1NO/Fuser
A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")
MARD1NO/GPTQ-triton
GPTQ inference Triton kernel
MARD1NO/InferLLM
a lightweight LLM model inference framework
MARD1NO/INT8-Flash-Attention-FMHA-Quantization
MARD1NO/kernl
Kernl lets you run PyTorch transformer models several times faster on GPU with a single line of code, and is designed to be easily hackable.
MARD1NO/LLMsPracticalGuide
MARD1NO/LLMSurvey
A collection of papers and resources related to Large Language Models.
MARD1NO/nanoPyC
MARD1NO/nccl-tests
NCCL Tests
MARD1NO/ppl.kernel.cuda
MARD1NO/PTX-ISA
CUDA PTX-ISA Document 中文翻译版
MARD1NO/RedPajama-Data
The RedPajama-Data repository contains code for preparing large datasets for training large language models.
MARD1NO/taichi-nerfs
Implementations of NeRF variants based on Taichi + PyTorch
MARD1NO/tiktoken
MARD1NO/triton
Development repository for the Triton language and compiler
MARD1NO/typst
A new markup-based typesetting system that is powerful and easy to learn.
MARD1NO/vllm
A high-throughput and memory-efficient inference and serving engine for LLMs