mohhao opened this issue 7 months ago · 0 comments
Because step dpo have an incomplete output that may influence the output of SFT model