Is CUDAGraph generation available during PPO training time？

Question

Is CUDAGraph generation available during PPO training time？

Closed this issue 3 months ago · 5 comments

Thanks for the great work! I saw that you support CUDA graph generation for fast inference. Is this effective during the PPO training time? Because PPO generation is really costly. or is there any fast generation method used in RealHF for training?

Answer 1 · 2024-09-13T02:55:19.000Z

Yes. We didn't perform an ablation, but it can lead to about 4x accelearation in our use cases (2048 prompt + 2048 gen).

We get results in our paper without CUDA graph. We will update the paper recently.

Answer 2 · 2024-09-13T03:03:32.000Z

Yes. We didn't perform an ablation, but it can lead to about 4x accelearation in our use cases (2048 prompt + 2048 gen).

We get results in our paper without CUDA graph. We will update the paper recently.

That's great! Is this CUDA graph auto applied now or I need to enable it use certain args?

Answer 3 · 2024-09-13T03:07:48.000Z

Try setting ppo.gen.use_cuda_graph=True as in this example. And you may want to set ppo.gen.force_no_logits_mask=True as well.

Please see the doc for detailed explanations.

Answer 4 · 2024-09-13T03:09:24.000Z

And you should use a proper docker image version according to your hardware. We've tested 24.03 for A100 and 24.07 for H100. Other combinations are not ensured to work well thanks to Nvidia.

Answer 5 · 2024-09-13T09:51:39.000Z

Thanks for all the helpful information!