deepseek-ai/DeepSeek-Coder-V2

reinforcement learning improvement

Ski-ing opened this issue · 2 comments

How significant is the improvement in code generation performance metrics attributed to the Group Relative Policy Optimization (GRPO) within the reinforcement learning component?

The performance of GRPO varies depending on the test sets. Generally, GRPO demonstrates an improvement of approximately 0.5 points on code generation test sets. The enhancements on math-related benchmarks are more substantial.

Thanks for your reply