/Pref-GRPO

Official implementation of Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning

Primary LanguagePythonOtherNOASSERTION

Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning

Hunyuan, Tencent & UnifiedReward Team

Paper PDF Project Page Project Page Project Page

Hugging Face Spaces Hugging Face Spaces Hugging Face Spaces Hugging Face Spaces

πŸ”₯ News

Please leave us a star ⭐ if you find this work helpful.

pref_grpo_pipeline

pref_grpo_pipeline

πŸ”§ Environment Set Up

  1. Clone this repository and navigate to the folder:
git clone https://github.com/CodeGoat24/UnifiedReward.git
cd UnifiedReward/Pref-GRPO
  1. Install the training package:
conda create -n PrefGRPO python=3.12
conda activate PrefGRPO

bash env_setup.sh fastvideo

cd open_clip
pip install -e .
cd ..
  1. Download Models
huggingface-cli download CodeGoat24/UnifiedReward-qwen-7b
huggingface-cli download CodeGoat24/UnifiedReward-Think-qwen-7b

wget https://huggingface.co/apple/DFN5B-CLIP-ViT-H-14-378/resolve/main/open_clip_pytorch_model.bin

πŸ’» Training

1. Deploy vLLM server

  1. Install vLLM
pip install vllm==0.9.0.1 transformers==4.52.4
  1. Start server
bash vllm_utils/vllm_server_UnifiedReward_Think.sh  

2. Preprocess training Data

we use training prompts in UniGenBench, as shown in "./data/unigenbench_train_data.txt".

bash fastvideo/data_preprocess/preprocess_flux_rl_embeddings.sh

3. Train

bash finetune_prefgrpo_flux.sh

πŸš€ Inference and Evaluation

we use test prompts in UniGenBench, as shown in "./data/unigenbench_test_data.csv".

bash inference/flux_dist_infer.sh

Then, evaluate the outputs following UniGenBench.

πŸ“§ Contact

If you have any comments or questions, please open a new issue or feel free to contact Yibin Wang.

πŸ€— Acknowledgments

Our training code is based on DanceGRPO, Flow-GRPO, and FastVideo.

We also use UniGenBench for T2I model semantic consistency evaluation.

Thanks to all the contributors!

⭐ Citation

@article{Pref-GRPO&UniGenBench,
  title={Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning},
  author={Wang, Yibin and Li, Zhimin and Zang, Yuhang and Zhou, Yujie and Bu, Jiazi and Wang, Chunyu and Lu, Qinglin, and Jin, Cheng and Wang, Jiaqi},
  journal={arXiv preprint arXiv:2508.20751},
  year={2025}
}