openpsi-project/ReaLHF

Is CUDAGraph generation available during PPO training time?

Closed this issue · 5 comments

DZ9 commented

Thanks for the great work! I saw that you support CUDA graph generation for fast inference. Is this effective during the PPO training time? Because PPO generation is really costly. or is there any fast generation method used in RealHF for training?

Yes. We didn't perform an ablation, but it can lead to about 4x accelearation in our use cases (2048 prompt + 2048 gen).

We get results in our paper without CUDA graph. We will update the paper recently.

DZ9 commented

Yes. We didn't perform an ablation, but it can lead to about 4x accelearation in our use cases (2048 prompt + 2048 gen).

We get results in our paper without CUDA graph. We will update the paper recently.

That's great! Is this CUDA graph auto applied now or I need to enable it use certain args?

Try setting ppo.gen.use_cuda_graph=True as in this example. And you may want to set ppo.gen.force_no_logits_mask=True as well.

Please see the doc for detailed explanations.

And you should use a proper docker image version according to your hardware. We've tested 24.03 for A100 and 24.07 for H100. Other combinations are not ensured to work well thanks to Nvidia.

DZ9 commented

Thanks for all the helpful information!