/CUDA-L1

GNU General Public License v3.0GPL-3.0

CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning

License     |     🏠  Project Page     |     📄  Paper     |     🔥  Demo

CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning

Introduction

In this paper, we introduce CUDA-L1, an automated reinforcement learning (RL) framework for CUDA optimization. The core of CUDA-L1 is a contrastive RL model, a newly-designed RL system to enhance optimization through comparative learning. CUDA-L1 achieves unprecedented performance improvements on the CUDA optimization task: trained on NVIDIA A100, it delivers an average speedup of ×17.7 across all 250 CUDA kernels of KernelBench, with peak speedups reaching ×449. Furthermore, the model also demonstrates excellent portability across GPU architectures, achieving average speedups of ×17.8 on H100, ×19.0 on RTX 3090, ×16.5 on L40, ×14.7 on H800, and ×13.9 on H20 despite being optimized specifically for A100. Beyond these benchmark results, CUDA-L1 demonstrates several remarkable properties:

  • It discovers a variety of CUDA optimization techniques and learns to combine them strategically to achieve optimal performance;
  • It uncovers fundamental principles of CUDA optimization, such as the multiplicative nature of optimizations and how certain "gatekeeper" techniques must be applied first to unlock the effectiveness of others;
  • It identifies non-obvious performance bottlenecks (such as CPU-GPU synchronization dominating compute optimizations) and rejects seemingly beneficial optimizations that actually harm performance.
Evaluation Results

How CUDA-L1 Works?

CUDA-L1 Pipeline

Stage 1: Supervised Learning

We augment the training dataset with CUDA code variants generated by LLMs and fine-tune the base model on executable and correct implementations to establish foundational CUDA knowledge.

Stage 2: Self-Supervised Learning

The model iteratively generates CUDA kernels, validates their correctness and executability, and trains on successfully validated examples, enabling autonomous improvement without human supervision.

Stage 3: Contrastive Reinforcement Learning

We employ contrastive learning with execution-time rewards, training the model to distinguish between faster and slower CUDA implementations, ultimately optimizing for superior performance.

Evaluation Results

Performance on KernelBench

Method Mean Max 75% 50% 25% Success
# out of total
Speedup
# out of total
All Levels 17.7× 449× 7.08× 1.81× 1.22× 249/250 242/250
Level 1 12.3× 166× 9.28× 1.65× 1.15× 99/100 96/100
Level 2 6.39× 111× 4.42× 1.61× 1.24× 100/100 97/100
Level 3 50.8× 449× 22.9× 2.66× 1.58× 50/50 49/50

Cross-GPU Performance

GPU Device Mean Max 75% 50% 25% Success Rate
A100 PCIe 17.7× 449× 7.08× 1.81× 1.22× 99.6%
H100 XSM 17.8× 1,001× 4.02× 1.63× 1.16× 98.4%
RTX 3090 19.0× 611× 4.41× 1.44× 1.11× 98.4%
L40 16.5× 365× 6.17× 1.61× 1.15× 98.8%
H800 XSM 14.7× 433× 4.80× 1.57× 1.16× 99.6%
H20 13.9× 412× 4.76× 1.54× 1.16× 99.2%
• CUDA-L1 was trained on A100 GPUs but shows excellent transfer to other architectures
• Level 3 tasks (complex ML operations) show the highest speedups, making CUDA-L1 especially valuable for real-world applications

Compare with Baseline Methods

We also compare baseline methods (backboned by Deepseek-R1, OpenAI O1) with CUDA-L1 on KernelBench.

CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning

Want to reproduce our results?

We provide CUDA code snippets optimized by CUDA-L1 in the optimized_cuda_code folder, with separate versions for each GPU device. For example, to reproduce our results on H100 XSM, download ./optimized_cuda_code/h100_xsm.json and run each code snippet on your H100 device.

Citation

@article{deepreinforce2025cudal1,
  title={CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning},
  author={Li, Xiaoya and Sun, Xiaofei and Wang, Albert and Li, Jiwei and Chris, Shum},
  journal={arXiv preprint arXiv:2507.14111},
  year={2025}
}

Contact

If you have any questions, please reach out to us at research@deep-reinforce.com.