cumtchw's Stars
deepseek-ai/DeepGEMM
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
Tony-Tan/CUDA_Freshman
ShadyBoukhary/GPU-research-FFT-OpenACC-CUDA
Case studies constitute a modern interdisciplinary and valuable teaching practice which plays a critical and fundamental role in the development of new skills and the formation of new knowledge. This research studies the behavior and performance of two interdisciplinary and widely adopted scientific kernels, a Fast Fourier Transform and Matrix Multiplication. Both routines are implemented in the two current most popular many-core programming models CUDA and OpenACC. A Fast Fourier Transform (FFT) samples a signal over a period of time and divides it into its frequency components, computing the Discrete Fourier Transform (DFT) of a sequence. Unlike the traditional approach to computing a DFT, FFT algorithms reduce the complexity of the problem from O(n2) to O(nLog2n). Matrix multiplication is a cornerstone routine in Mathematics, Artificial Intelligence and Machine Learning. This research also shows that the nature of the problem plays a crucial role in determining what many-core model will provide the highest benefit in performance.
airockchip/ultralytics_yolov8
NEW - YOLOv8 🚀 in PyTorch > ONNX > CoreML > TFLite
airockchip/rknn_model_zoo
a-hamdi/GPU
100 days of building GPU kernels!
Tongkaio/CUDA_Kernel_Samples
CUDA 算子手撕与面试指南
Maharshi-Pandya/cudacodes
Learnings and programs related to CUDA
Open-LLM-VTuber/Open-LLM-VTuber
Talk to any LLM with hands-free voice interaction, voice interruption, and Live2D taking face running locally across platforms
ifzhang/ByteTrack
[ECCV 2022] ByteTrack: Multi-Object Tracking by Associating Every Detection Box
ultralytics/ultralytics
Ultralytics YOLO11 🚀
ireader/media-server
RTSP/RTP/RTMP/FLV/HLS/MPEG-TS/MPEG-PS/MPEG-DASH/MP4/fMP4/MKV/WebM
ZLMediaKit/ZLMediaKit
WebRTC/RTSP/RTMP/HTTP/HLS/HTTP-FLV/WebSocket-FLV/HTTP-TS/HTTP-fMP4/WebSocket-TS/WebSocket-fMP4/GB28181/SRT server and client framework based on C++11
gelldur/EventBus
A lightweight and very fast event bus / event framework for C++17
hiyouga/LLaMA-Factory
Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)
NVIDIA/workbench-llamafactory
This is an NVIDIA AI Workbench example project that demonstrates an end-to-end model development workflow using Llamafactory.
NVIDIA/cuda-samples
Samples for CUDA Developers which demonstrates features in CUDA Toolkit
Cambricon/CNStream
CNStream is a streaming framework for building Cambricon machine learning pipelines http://forum.cambricon.com https://gitee.com/SolutionSDK/CNStream
brucefan1983/CUDA-Programming
Sample codes for my CUDA programming book
progschj/ThreadPool
A simple C++11 Thread Pool implementation
cumtchw/MemoryPool
C++内存池的高级实现,包含代码详解、CMake构建工程、应用实例。
cacay/MemoryPool
An easy to use and efficient memory pool allocator written in C++.
wispytrace/magik-toolkit
open-webui/open-webui
User-friendly AI Interface (Supports Ollama, OpenAI API, ...)
MarkFzp/act-plus-plus
Imitation learning algorithms with Co-training for Mobile ALOHA: ACT, Diffusion Policy, VINN
MarkFzp/mobile-aloha
Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation
weaigc/bingo
Bingo,一个让你呼吸顺畅 New Bing。
nxp-imx/uboot-imx
i.MX U-Boot
diffgram/diffgram
The AI Datastore for Schemas, BLOBs, and Predictions. Use with your apps or integrate built-in Human Supervision, Data Workflow, and UI Catalog to get the most value out of your AI Data.
microsoft/onnxruntime
ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator