CUDA-L1: A repository from misselvexu

CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning

| 🏠 Project Page | 📄 Paper | 🔥 Demo

Introduction

In this paper, we introduce CUDA-L1, an automated reinforcement learning (RL) framework for CUDA optimization. The core of CUDA-L1 is a contrastive RL model, a newly-designed RL system to enhance optimization through comparative learning. CUDA-L1 achieves unprecedented performance improvements on the CUDA optimization task: trained on NVIDIA A100, it delivers an average speedup of ×17.7 across all 250 CUDA kernels of KernelBench, with peak speedups reaching ×449. Furthermore, the model also demonstrates excellent portability across GPU architectures, achieving average speedups of ×17.8 on H100, ×19.0 on RTX 3090, ×16.5 on L40, ×14.7 on H800, and ×13.9 on H20 despite being optimized specifically for A100. Beyond these benchmark results, CUDA-L1 demonstrates several remarkable properties:

It discovers a variety of CUDA optimization techniques and learns to combine them strategically to achieve optimal performance;
It uncovers fundamental principles of CUDA optimization, such as the multiplicative nature of optimizations and how certain "gatekeeper" techniques must be applied first to unlock the effectiveness of others;
It identifies non-obvious performance bottlenecks (such as CPU-GPU synchronization dominating compute optimizations) and rejects seemingly beneficial optimizations that actually harm performance.

How CUDA-L1 Works?

Stage 1: Supervised Learning

We augment the training dataset with CUDA code variants generated by LLMs and fine-tune the base model on executable and correct implementations to establish foundational CUDA knowledge.

Stage 2: Self-Supervised Learning

The model iteratively generates CUDA kernels, validates their correctness and executability, and trains on successfully validated examples, enabling autonomous improvement without human supervision.

Stage 3: Contrastive Reinforcement Learning

We employ contrastive learning with execution-time rewards, training the model to distinguish between faster and slower CUDA implementations, ultimately optimizing for superior performance.

Evaluation Results

Performance on KernelBench

Method	Mean	Max	75%	50%	25%	Success ^{# out of total}	Speedup ^{# out of total}
All Levels	17.7×	449×	7.08×	1.81×	1.22×	249/250	242/250
Level 1	12.3×	166×	9.28×	1.65×	1.15×	99/100	96/100
Level 2	6.39×	111×	4.42×	1.61×	1.24×	100/100	97/100
Level 3	50.8×	449×	22.9×	2.66×	1.58×	50/50	49/50

Cross-GPU Performance

GPU Device	Mean	Max	75%	50%	25%	Success Rate
A100 PCIe	17.7×	449×	7.08×	1.81×	1.22×	99.6%
H100 XSM	17.8×	1,001×	4.02×	1.63×	1.16×	98.4%
RTX 3090	19.0×	611×	4.41×	1.44×	1.11×	98.4%
L40	16.5×	365×	6.17×	1.61×	1.15×	98.8%
H800 XSM	14.7×	433×	4.80×	1.57×	1.16×	99.6%
H20	13.9×	412×	4.76×	1.54×	1.16×	99.2%

^{• CUDA-L1 was trained on A100 GPUs but shows excellent transfer to other architectures}
^{• Level 3 tasks (complex ML operations) show the highest speedups, making CUDA-L1 especially valuable for real-world applications}

Compare with Baseline Methods

We also compare baseline methods (backboned by Deepseek-R1, OpenAI O1) with CUDA-L1 on KernelBench.

Want to reproduce our results?

We provide CUDA code snippets optimized by CUDA-L1 in the optimized_cuda_code folder, with separate versions for each GPU device. For example, to reproduce our results on H100 XSM, download ./optimized_cuda_code/h100_xsm.json and run each code snippet on your H100 device.

Citation

@article{deepreinforce2025cudal1,
  title={CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning},
  author={Li, Xiaoya and Sun, Xiaofei and Wang, Albert and Li, Jiwei and Chris, Shum},
  journal={arXiv preprint arXiv:2507.14111},
  year={2025}
}

Contact

If you have any questions, please reach out to us at research@deep-reinforce.com.

misselvexu/CUDA-L1