Pinned Repositories
AM207
ao
torchao: PyTorch Architecture Optimization (AO). A repository to host AO techniques and performant kernels that work with PyTorch.
cme213_material_2013
CME 213 Class Material
cryptocurrency-derivatives-pricing-and-delta-neutral-volatility-trading
This project is to download and analyze cryptocurrency option data available on Deribit via a public API. Data are collected on an Ubuntu remote server with the implementation of Python3, Shell and SQLite and are then analyzed locally with Python3.
DL_packt
intro_to_simpy
Python-Financial-Tools
Providing financial analysis tools to the Python open-source community.
torchtune
A Native-PyTorch Library for LLM Fine-tuning
triton-rs
unsloth-notebooks
Unsloth Fine-tuning Notebooks for Google Colab, Kaggle, Hugging Face and more.
jeromeku's Repositories
jeromeku/cuda-samples
Samples for CUDA Developers which demonstrates features in CUDA Toolkit
jeromeku/CUDALibrarySamples
CUDA Library Samples
jeromeku/cutlass
CUDA Templates for Linear Algebra Subroutines
jeromeku/DeepEP
DeepEP: an efficient expert-parallel communication library
jeromeku/gemm-cublas
jeromeku/gpu-experiments
A collection of GPU tests and benchmarks for my own research.
jeromeku/hilt
jeromeku/kineto
A CPU+GPU Profiling library that provides access to timeline traces and hardware performance counters.
jeromeku/llvm-project
The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
jeromeku/Megakernels
kernels, of the mega variety
jeromeku/Megatron-LM
Ongoing research training transformer models at scale
jeromeku/mlir-list
MLIR project for the "Define and lower your dialect" session, at the MLIR Summer School 2025
jeromeku/modular
The Modular Platform (includes MAX & Mojo)
jeromeku/nvbench
CUDA Kernel Benchmarking Library
jeromeku/nvshmem
NVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process communication and coordination overheads by allowing programmers to perform one-sided communication from within CUDA kernels and on CUDA streams.
jeromeku/nvshmem-nccl
This is a set of simple programs that can be used to explore the features of a parallel platform.
jeromeku/playing-with-mlir
jeromeku/pytorch
Tensors and Dynamic neural networks in Python with strong GPU acceleration
jeromeku/quack
A Quirky Assortment of CuTe Kernels
jeromeku/qutlass
QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning
jeromeku/sglang
SGLang is a fast serving framework for large language models and vision language models.
jeromeku/slime
slime is a LLM post-training framework aiming for RL Scaling.
jeromeku/TensorRT-LLM
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in performant way.
jeromeku/ThunderKittens
Tile primitives for speedy kernels
jeromeku/torch_vmm_alloc
Allow torch tensor memory to be released and resumed later
jeromeku/torchtitan
A PyTorch native library for large model training
jeromeku/TransformerEngine
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper, Ada and Blackwell GPUs, to provide better performance with lower memory utilization in both training and inference.
jeromeku/triton
Development repository for the Triton language and compiler
jeromeku/tritonparse
TritonParse: A Compiler Tracer, Visualizer, and mini-Reproducer(WIP) for Triton Kernels
jeromeku/vllm
A high-throughput and memory-efficient inference and serving engine for LLMs