caoxiang520's Stars
angry-crab/cudla_dev
ifromeast/cuda_learning
learning how CUDA works
leimao/CUDA-GEMM-Optimization
CUDA Matrix Multiplication Optimization
luliyucoordinate/CUDA-GEMM-Optimization
CUDA Matrix Multiplication Optimization
luliyucoordinate/cute-flash-attention
Implement Flash Attention using Cute.
tmlfrkn/CUDAIntegratedTransformerTool
vllm-project/vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
bytedance/lightseq
LightSeq: A High Performance Library for Sequence Processing and Generation
megvii-research/MOTRv2
[CVPR2023] MOTRv2: Bootstrapping End-to-End Multi-Object Tracking by Pretrained Object Detectors
megvii-research/MOTR
[ECCV2022] MOTR: End-to-End Multiple-Object Tracking with TRansformer
ZikangZhou/HiVT
[CVPR 2022] HiVT: Hierarchical Vector Transformer for Multi-Agent Motion Prediction
OpenDriveLab/UniAD
[CVPR 2023 Best Paper Award] Planning-oriented Autonomous Driving
zjhellofss/KuiperInfer
校招、秋招、春招、实习好项目!带你从零实现一个高性能的深度学习推理库,支持大模型 llama2 、Unet、Yolov5、Resnet等模型的推理。Implement a high-performance deep learning inference library step by step
zjhellofss/KuiperLLama
校招、秋招、春招、实习好项目,带你从零动手实现支持LLama2/3和Qwen2.5的大模型推理框架。
flashinfer-ai/flashinfer
FlashInfer: Kernel Library for LLM Serving
bytedance/byteir
A model compilation solution for various hardware
InfiniTensor/InfiniTensor
Oneflow-Inc/trt_flash_attention
Oneflow-Inc/flash-attention-v2
Fast and memory-efficient exact attention
NVIDIA/FasterTransformer
Transformer related optimization, including BERT, GPT
linClubs/BEVDet-ROS-TensorRT
BEVDet online real-time inference using CUDA, TensorRT, ROS1 & C++.
maggiez0138/Swin-Transformer-TensorRT
This project aims to explore the deployment of Swin-Transformer based on TensorRT, including the test results of FP16 and INT8.
OpenPPL/ppl.nn
A primitive library for neural network
owenliang/pytorch-transformer
pytorch复现transformer
awagner8/Kvcache
laugh12321/TensorRT-YOLO
TensorRT-YOLO: A high-performance, easy-to-use YOLO deployment toolkit for NVIDIA, powered by TensorRT plugins and CUDA Graph, supporting C++ and Python.
ShaYeBuHui01/flash_attention_inference
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
chainyo/transformers-pipeline-onnx
How to export Hugging Face's 🤗 NLP Transformers models to ONNX and use the exported model with the appropriate Transformers pipeline.
66RING/tiny-flash-attention
flash attention tutorial written in python, triton, cuda, cutlass
Oneflow-Inc/flash-attention
Fast and memory-efficient exact attention