Pinned Repositories
Awesome-GPU
Awesome resources for GPUs
cmake-examples
Useful CMake Examples
CUDA-PPT
cute-gemm
LearnDLSysCourse
Learning_CUDA
MadMario-OneFlow
oneflow
OneFlow is a performance-centered and open-source deep learning framework.
paper_reading
Tools
Collect some useful code.
MARD1NO's Repositories
MARD1NO/Learning_CUDA
MARD1NO/paper_reading
MARD1NO/LearningTemplates
MARD1NO/Note
MARD1NO/NVIDIA_SGEMM_PRACTICE
Step-by-step optimization of CUDA SGEMM
MARD1NO/AI-System
System for AI Education Resource.
MARD1NO/baidu-allreduce
MARD1NO/ConcurrencyInAction
MARD1NO/Cpp-Concurrency-in-Action-2ed
C++11/14/17/20 multithreading, involving operating system principles and concurrent programming technology.
MARD1NO/cuda-samples
Samples for CUDA Developers which demonstrates features in CUDA Toolkit
MARD1NO/cuda-training-series
Training materials associated with NVIDIA's CUDA Training Series (www.olcf.ornl.gov/cuda-training-series/)
MARD1NO/data
A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries.
MARD1NO/DATASTRUCT
MARD1NO/DeepRec
DeepRec is a recommendation engine based on TensorFlow.
MARD1NO/DesignPattern
C++11全套设计模式-23种指针的用法(a full DesignPattern implement with c++11)
MARD1NO/ECE408
MARD1NO/fairring
Fairring (FAIR + Herring) is a plug-in for PyTorch that provides a process group for distributed training that outperforms NCCL at large scales
MARD1NO/FBTT-Embedding
This is a Tensor Train based compression library to compress sparse embedding tables used in large-scale machine learning models such as recommendation and natural language processing. We showed this library can reduce the total model size by up to 100x in Facebook’s open sourced DLRM model while achieving same model quality. Our implementation is faster than the state-of-the-art implementations. Existing the state-of-the-art library also decompresses the whole embedding tables on the fly therefore they do not provide memory reduction during runtime of the training. Our library decompresses only the requested rows therefore can provide 10,000 times memory footprint reduction per embedding table. The library also includes a software cache to store a portion of the entries in the table in decompressed format for faster lookup and process.
MARD1NO/flash-attention
MARD1NO/Learning_compile
MARD1NO/LearnRust
MARD1NO/MARD1NO
MARD1NO/MetaNN
MARD1NO/MIT6.S081
MARD1NO/openmlsys-cuda
Tutorials for writing high-performance GPU operators in AI frameworks.
MARD1NO/powersgd
Practical low-rank gradient compression for distributed optimization: https://arxiv.org/abs/1905.13727
MARD1NO/pytorch
Tensors and Dynamic neural networks in Python with strong GPU acceleration
MARD1NO/tensorflow-internals
It is open source ebook about TensorFlow kernel and implementation mechanism.