Open source dataset
yuanzhiyong1999 opened this issue · 2 comments
It is mentioned in the paper that the SFT and DPO datasets built based on AUTOIF will be open-sourced together with Qwen2-72B. May I ask where they are?
May I ask if each line in seed_instruction.txt and augment_instructions.txt corresponds to each other? Is the data in augment an enhanced version of the corresponding line in seed?
@dongguanting
May I ask if each line in seed_instruction.txt and augment_instructions.txt corresponds to each other? Is the data in augment an enhanced version of the corresponding line in seed? @dongguanting
Thank you for your attention, As shown in the prompt example in our RFT.py:
we will have the supervision model directly generate 50 augmented instructions at a time from the seed data, so there is no one-to-one correspondence.