RLHFlow/Online-RLHF

Question about the iteration dataset (information leakage)?

hhhhzzzzz opened this issue · 8 comments

Hi,

In the RLHFlow/iterative-prompt-v1-iter1-20K dataset, I've noticed that the 'context_messages' include entries labeled as 'assistant.' However, in the hendrydong/preference_700K dataset, the context labeled as 'user' is the prompt, and the context labeled as 'assistant' is the response.

This raises the question: Does the RLHFlow/iterative-prompt-v1-iter1-20K dataset include responses given the prompt? If so, could this be considered information leakage within the dataset? In my opinion, the RLHFlow/iterative-prompt-v1-iter1-20K dataset should only include the prompts, not the responses.

WechatIMG1248

Hi, I think this is because some of the prompt are multi-turn, which means that it consists of a conversation history instead of a single instruction. For instance, many of the prompts from HH-RLHF are multi-turn.

Handling multi-turn chat is more complicated. If you only want to work with single-turn chat, you can try the ultra feedback prompt in : weqweasdas/ultra_prompt_split

Hi @WeiXiongUST , can you suggest some repositories or instructions that hint us how to handle multi-turn chat?

Hi @WeiXiongUST , can you suggest some repositories or instructions that hint us how to handle multi-turn chat?

The reward modeling present in the RLHF Workflow can handle multi-turn chat because we include the HH-RLHF in the training data. The DPO and PPO also naturally handle the multi-turn chat by using the multi-turn conversation history as a prompt.

To learn multi-turn in the response part, we need new algorithms. I have a paper on this topic but is still under the internal review of google...

Hi, I think this is because some of the prompt are multi-turn, which means that it consists of a conversation history instead of a single instruction. For instance, many of the prompts from HH-RLHF are multi-turn.

so in the hendrydong/preference_700K dataset, the choosen[:-1] (or rejected[:-1]) can be regarded as prompt? and choosen[-1] can be regarded as response?

Hi, I think this is because some of the prompt are multi-turn, which means that it consists of a conversation history instead of a single instruction. For instance, many of the prompts from HH-RLHF are multi-turn.

so in the hendrydong/preference_700K dataset, the choosen[:-1] (or rejected[:-1]) can be regarded as prompt? and choosen[-1] can be regarded as response?

Yes, you are correct!

Hi, I think this is because some of the prompt are multi-turn, which means that it consists of a conversation history instead of a single instruction. For instance, many of the prompts from HH-RLHF are multi-turn.

so in the hendrydong/preference_700K dataset, the choosen[:-1] (or rejected[:-1]) can be regarded as prompt? and choosen[-1] can be regarded as response?

Yes, you are correct!

Thanks a lot for your reply! Thanks!

Hi,

Could you tell me how to select prompts to form the iterative dataset?