jeromeku

Pinned Repositories

AM207
Language:HTML4 1 07
ao
torchao: PyTorch Architecture Optimization (AO). A repository to host AO techniques and performant kernels that work with PyTorch.
Language:Python0 0 00
cme213_material_2013
CME 213 Class Material
Language:Cuda4 1 025
cryptocurrency-derivatives-pricing-and-delta-neutral-volatility-trading
This project is to download and analyze cryptocurrency option data available on Deribit via a public API. Data are collected on an Ubuntu remote server with the implementation of Python3, Shell and SQLite and are then analyzed locally with Python3.
Language:Python16 0 014
DL_packt
Language:Jupyter Notebook8 4 04
intro_to_simpy
Language:Jupyter Notebook1 0 02
Python-Financial-Tools
Providing financial analysis tools to the Python open-source community.
Language:Python65 11 036
torchtune
A Native-PyTorch Library for LLM Fine-tuning
Language:Python0 0 00
triton-rs
Language:Rust11 1 02
unsloth-notebooks
Unsloth Fine-tuning Notebooks for Google Colab, Kaggle, Hugging Face and more.
Language:Jupyter Notebook3 0 00

jeromeku's Repositories

jeromeku/cuda-samples
Samples for CUDA Developers which demonstrates features in CUDA Toolkit
jeromeku/CUDALibrarySamples
CUDA Library Samples
jeromeku/cutlass
CUDA Templates for Linear Algebra Subroutines
Language:C++0 0
jeromeku/DeepEP
DeepEP: an efficient expert-parallel communication library
jeromeku/gemm-cublas
Language:C++
jeromeku/gpu-experiments
A collection of GPU tests and benchmarks for my own research.
jeromeku/hilt
jeromeku/kineto
A CPU+GPU Profiling library that provides access to timeline traces and hardware performance counters.
Language:HTML0 0
jeromeku/llvm-project
The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
jeromeku/Megakernels
kernels, of the mega variety
jeromeku/Megatron-LM
Ongoing research training transformer models at scale
Language:Python
jeromeku/mlir-list
MLIR project for the "Define and lower your dialect" session, at the MLIR Summer School 2025
jeromeku/modular
The Modular Platform (includes MAX & Mojo)
jeromeku/nvbench
CUDA Kernel Benchmarking Library
jeromeku/nvshmem
NVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process communication and coordination overheads by allowing programmers to perform one-sided communication from within CUDA kernels and on CUDA streams.
Language:C++0 0
jeromeku/nvshmem-nccl
This is a set of simple programs that can be used to explore the features of a parallel platform.
jeromeku/playing-with-mlir
jeromeku/pytorch
Tensors and Dynamic neural networks in Python with strong GPU acceleration
Language:Python
jeromeku/quack
A Quirky Assortment of CuTe Kernels
jeromeku/qutlass
QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning
jeromeku/sglang
SGLang is a fast serving framework for large language models and vision language models.
Language:Python0 0
jeromeku/slime
slime is a LLM post-training framework aiming for RL Scaling.
Language:Python0 0
jeromeku/TensorRT-LLM
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in performant way.
Language:C++
jeromeku/ThunderKittens
Tile primitives for speedy kernels
jeromeku/torch_vmm_alloc
Allow torch tensor memory to be released and resumed later
jeromeku/torchtitan
A PyTorch native library for large model training
Language:Python
jeromeku/TransformerEngine
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper, Ada and Blackwell GPUs, to provide better performance with lower memory utilization in both training and inference.
Language:Python
jeromeku/triton
Development repository for the Triton language and compiler
Language:MLIR0 0
jeromeku/tritonparse
TritonParse: A Compiler Tracer, Visualizer, and mini-Reproducer(WIP) for Triton Kernels
jeromeku/vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
Language:Python0 0