MARD1NO

Paddle Sucks | Rebound

SiliconFlowNeverland

Pinned Repositories

Awesome-GPU
Awesome resources for GPUs
2 0 01
cmake-examples
Useful CMake Examples
Language:CMake0 0 00
CUDA-PPT
56 2 09
cute-gemm
Language:C++1 0 00
LearnDLSysCourse
Language:Python12 1 01
Learning_CUDA
Language:Cuda23 3 09
MadMario-OneFlow
Language:Python3 1 00
oneflow
OneFlow is a performance-centered and open-source deep learning framework.
Language:C++0 0 00
paper_reading
Language:Jupyter Notebook6 1 02
Tools
Collect some useful code.
Language:C++8 1 00

MARD1NO's Repositories

MARD1NO/Learning_CUDA
Language:Cuda23 3 09
MARD1NO/paper_reading
Language:Jupyter Notebook6 1 02
MARD1NO/LearningTemplates
Language:C++1 1 00
MARD1NO/Note
1 1 0
MARD1NO/NVIDIA_SGEMM_PRACTICE
Step-by-step optimization of CUDA SGEMM
Language:Cuda1 0 01
MARD1NO/AI-System
System for AI Education Resource.
Language:Python0 0
MARD1NO/baidu-allreduce
MARD1NO/ConcurrencyInAction
Language:C++
MARD1NO/Cpp-Concurrency-in-Action-2ed
C++11/14/17/20 multithreading, involving operating system principles and concurrent programming technology.
Language:C++0 0
MARD1NO/cuda-samples
Samples for CUDA Developers which demonstrates features in CUDA Toolkit
MARD1NO/cuda-training-series
Training materials associated with NVIDIA's CUDA Training Series (www.olcf.ornl.gov/cuda-training-series/)
Language:Cuda0 0
MARD1NO/data
A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries.
Language:Python0 0
MARD1NO/DATASTRUCT
Language:Python1 1
MARD1NO/DeepRec
DeepRec is a recommendation engine based on TensorFlow.
Language:C++0 0
MARD1NO/DesignPattern
C++11全套设计模式-23种指针的用法(a full DesignPattern implement with c++11)
Language:C++0 0
MARD1NO/ECE408
Language:C++0 0
MARD1NO/fairring
Fairring (FAIR + Herring) is a plug-in for PyTorch that provides a process group for distributed training that outperforms NCCL at large scales
MARD1NO/FBTT-Embedding
This is a Tensor Train based compression library to compress sparse embedding tables used in large-scale machine learning models such as recommendation and natural language processing. We showed this library can reduce the total model size by up to 100x in Facebook’s open sourced DLRM model while achieving same model quality. Our implementation is faster than the state-of-the-art implementations. Existing the state-of-the-art library also decompresses the whole embedding tables on the fly therefore they do not provide memory reduction during runtime of the training. Our library decompresses only the requested rows therefore can provide 10,000 times memory footprint reduction per embedding table. The library also includes a software cache to store a portion of the entries in the table in decompressed format for faster lookup and process.
Language:Cuda0 0
MARD1NO/flash-attention
Language:C++0 0
MARD1NO/Learning_compile
1 0
MARD1NO/LearnRust
1 0
MARD1NO/MARD1NO
1 02
MARD1NO/MetaNN
Language:C++0 0
MARD1NO/MIT6.S081
0 0
MARD1NO/openmlsys-cuda
Tutorials for writing high-performance GPU operators in AI frameworks.
Language:Cuda0 0
MARD1NO/powersgd
Practical low-rank gradient compression for distributed optimization: https://arxiv.org/abs/1905.13727
Language:Python0 0
MARD1NO/pytorch
Tensors and Dynamic neural networks in Python with strong GPU acceleration
Language:C++0 0
MARD1NO/tensorflow-internals
It is open source ebook about TensorFlow kernel and implementation mechanism.
Language:TeX0 0