Pinned Repositories
AM207
ao
torchao: PyTorch Architecture Optimization (AO). A repository to host AO techniques and performant kernels that work with PyTorch.
cme213_material_2013
CME 213 Class Material
cryptocurrency-derivatives-pricing-and-delta-neutral-volatility-trading
This project is to download and analyze cryptocurrency option data available on Deribit via a public API. Data are collected on an Ubuntu remote server with the implementation of Python3, Shell and SQLite and are then analyzed locally with Python3.
DL_packt
intro_to_simpy
Python-Financial-Tools
Providing financial analysis tools to the Python open-source community.
torchtune
A Native-PyTorch Library for LLM Fine-tuning
triton-rs
unsloth-notebooks
Unsloth Fine-tuning Notebooks for Google Colab, Kaggle, Hugging Face and more.
jeromeku's Repositories
jeromeku/ao
torchao: PyTorch Architecture Optimization (AO). A repository to host AO techniques and performant kernels that work with PyTorch.
jeromeku/cuda-python
CUDA Python: Performance meets Productivity
jeromeku/cutlass
CUDA Templates for Linear Algebra Subroutines
jeromeku/DeepEP
DeepEP: an efficient expert-parallel communication library
jeromeku/flashinfer-bench
Building the Virtuous Cycle for AI-driven LLM Systems
jeromeku/gpu-experiments
A collection of GPU tests and benchmarks for my own research.
jeromeku/helion
A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.
jeromeku/llvm-project
The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
jeromeku/llvm-tutor
A collection of out-of-tree LLVM passes for teaching and learning
jeromeku/Megatron-Bridge
Training library for Megatron-based models
jeromeku/Megatron-LM
Ongoing research training transformer models at scale
jeromeku/memory-layout
This repository contains small experiments to visualize how C++ objects are laid out in memory, with a focus on vptr, vtable, object members, and memory padding. It accompanies a series of blog posts exploring these concepts in depth.
jeromeku/mlir-list
MLIR project for the "Define and lower your dialect" session, at the MLIR Summer School 2025
jeromeku/mlir-tutorial-brz
SBLP 2025 MLIR Tutorial
jeromeku/modular
The Modular Platform (includes MAX & Mojo)
jeromeku/nvshmem
NVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process communication and coordination overheads by allowing programmers to perform one-sided communication from within CUDA kernels and on CUDA streams.
jeromeku/OpenEnv
An interface library for RL post training with environments.
jeromeku/perf-ninja
This is an online course where you can learn and master the skill of low-level performance analysis and tuning.
jeromeku/pt-autopar
An experimental implementation of compiler-driven automatic sharding of models across a given device mesh.
jeromeku/pytorch
Tensors and Dynamic neural networks in Python with strong GPU acceleration
jeromeku/quack
A Quirky Assortment of CuTe Kernels
jeromeku/sglang
SGLang is a fast serving framework for large language models and vision language models.
jeromeku/TensorRT-LLM
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in performant way.
jeromeku/TensorRT-Model-Optimizer
nvidia-modelopt is a unified library of state-of-the-art model optimization techniques like quantization, pruning, distillation, speculative decoding, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed.
jeromeku/torchcomms
torchcomms: a modern PyTorch communications API
jeromeku/torchforge
PyTorch-native post-training at scale
jeromeku/torchtitan
A PyTorch native library for large model training
jeromeku/TransformerEngine
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper, Ada and Blackwell GPUs, to provide better performance with lower memory utilization in both training and inference.
jeromeku/triton
Development repository for the Triton language and compiler
jeromeku/vllm
A high-throughput and memory-efficient inference and serving engine for LLMs