Pinned Repositories
Awesome-GPU
Awesome resources for GPUs
cmake-examples
Useful CMake Examples
CUDA-PPT
cute-gemm
LearnDLSysCourse
Learning_CUDA
MadMario-OneFlow
oneflow
OneFlow is a performance-centered and open-source deep learning framework.
paper_reading
Tools
Collect some useful code.
MARD1NO's Repositories
MARD1NO/CUDA-PPT
MARD1NO/Tools
Collect some useful code.
MARD1NO/cute-gemm
MARD1NO/Loser-HomeWork
卢瑟们的作业,展示以及答案讲解
MARD1NO/open-resume
OpenResume is a powerful open-source resume builder and resume parser. https://open-resume.com/
MARD1NO/tutorial-multi-gpu
Efficient Distributed GPU Programming for Exascale, an SC/ISC Tutorial
MARD1NO/DeepSpeed
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
MARD1NO/causal-conv1d
Causal depthwise conv1d in CUDA, with a PyTorch interface
MARD1NO/Cutlass_EX
study of cutlass
MARD1NO/dynolog
Dynolog is a telemetry daemon for performance monitoring and tracing. It exports metrics from different components in the system like the linux kernel, CPU, disks, Intel PT, GPUs etc. Dynolog also integrates with pytorch and can trigger traces for distributed training applications.
MARD1NO/excalidraw
Virtual whiteboard for sketching hand-drawn like diagrams
MARD1NO/flash_attention_inference
Performance of the C++ interface of flash attention, flash attention v2 and self decoding attention in large language model (LLM) inference scenarios.
MARD1NO/HighPerformance
Cpp HighPerformance
MARD1NO/InferLLM
a lightweight LLM model inference framework
MARD1NO/kunlun.cpp
MARD1NO/LLMsPracticalGuide
MARD1NO/lmdeploy
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
MARD1NO/LookaheadDecoding
MARD1NO/MARD1NO.github.io
MARD1NO/Megatron-LLaMA
Best practice for training LLaMA models in Megatron-LM
MARD1NO/mscclpp
MSCCL++: A GPU-driven communication stack for scalable AI applications
MARD1NO/nanobind
nanobind: tiny and efficient C++/Python bindings
MARD1NO/Paddle
PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
MARD1NO/PaddleNLP
👑 Easy-to-use and powerful NLP library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis and 🖼 Diffusion AICG system etc.
MARD1NO/ppl.llm.kernel.cuda
MARD1NO/punica
MARD1NO/S-LoRA
S-LoRA: Serving Thousands of Concurrent LoRA Adapters
MARD1NO/simplified_transformers
MARD1NO/TensorRT-LLM
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
MARD1NO/vllm
A high-throughput and memory-efficient inference and serving engine for LLMs