lessw2020
AI/PyTorch Partner Engineer - Meta AI (Facebook AI) Principal Software Engineer - Audere Software Architect - X10 Wireless Dev/PM - Microsoft
Seattle, WA USA
Pinned Repositories
Best-Deep-Learning-Optimizers
Collection of the latest, greatest, deep learning optimizers (for Pytorch) - CNN, NLP suitable
mish
Mish Deep Learning Activation Function for PyTorch / FastAI
mrnet-fastai
Deep Learning CNN using FastAI for the Stanford MRNet Knee MRI diagnosis challenge
Ranger-Deep-Learning-Optimizer
Ranger - a synergistic optimizer using RAdam (Rectified Adam), Gradient Centralization and LookAhead in one codebase
Ranger-Mish-ImageWoof-5
Repo to build on / reproduce the record breaking Ranger-Mish-SelfAttention setup on FastAI ImageWoof dataset 5 epochs
Ranger21
Ranger deep learning optimizer rewrite to use newest components
Ranger22
Testing various improvements to Ranger21 for 2022
res2net-plus
Res2Net architecture with improved stem and Mish activation function
training-detr
Unofficial Colab on how to train DETR, the intelligent object detector, with your own dataset. DETR = Detection Transformer
transformer_central
Various transformers for FSDP research
lessw2020's Repositories
lessw2020/FAdam_PyTorch
an implementation of FAdam (Fisher Adam) in PyTorch
lessw2020/triton_kernels_for_fun_and_profit
Custom kernels in Triton language for accelerating LLMs
lessw2020/cuda-kernel-dev
in progress cuda kernels
lessw2020/tau_graph
Pipeline Parallelism for PyTorch
lessw2020/actnn
ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training
lessw2020/apex_nvidia
A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
lessw2020/asynch-checkpointing
PyTorch checkpointing
lessw2020/cfx-research
lessw2020/ColossalAI
Making large AI models cheaper, faster and more accessible
lessw2020/cutlass_local
CUDA Templates for Linear Algebra Subroutines
lessw2020/dietgpu
GPU implementation of a fast generalized ANS (asymmetric numeral system) entropy encoder and decoder, with extensions for lossless compression of numerical and other data types in HPC/ML applications.
lessw2020/float8_experimental
This repository contains the experimental PyTorch native float8 training UX
lessw2020/fp6_llm
An efficient GPU support for LLM inference with 6-bit quantization (FP6).
lessw2020/general_utils
lessw2020/gitlfs
lessw2020/largefiles
lessw2020/marlin-kernel
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
lessw2020/MatmulPingPong
A Easy-to-understand TensorOp Matmul Tutorial
lessw2020/megalodon
Reference implementation of Megalodon 7B model
lessw2020/nvcomp
Repository for nvCOMP docs and examples. nvCOMP is a library for fast lossless compression/decompression on the GPU that can be downloaded from https://developer.nvidia.com/nvcomp.
lessw2020/NVIDIA_SGEMM_PRACTICE
Step-by-step optimization of CUDA SGEMM
lessw2020/pingpong
Integrating pingpong kernel into PyTorch
lessw2020/pytorch_fork
<forked> Tensors and Dynamic neural networks in Python with strong GPU acceleration
lessw2020/spacebyte
A byte-level decoder architecture that matches the performance of tokenized Transformers.
lessw2020/SpeeD
SpeeD: A Closer Look at Time Steps is Worthy of Triple Speed-Up for Diffusion Model Training
lessw2020/tf32_gemm
Example of binding a TF32 CUTLASS GEMM kernel to PyTorch
lessw2020/torchtitan_oss
A native PyTorch Library for large model training
lessw2020/torchtune
A Native-PyTorch Library for LLM Fine-tuning
lessw2020/UVM_Tensor
experimental - CUDA Unified Virtual Memory based tensors with PyTorch
lessw2020/vllm
A high-throughput and memory-efficient inference and serving engine for LLMs