Question about the iteration dataset (information leakage)?

Question

Question about the iteration dataset (information leakage)?

hhhhzzzzz opened this issue 8 months ago · 8 comments

Hi,

In the RLHFlow/iterative-prompt-v1-iter1-20K dataset, I've noticed that the 'context_messages' include entries labeled as 'assistant.' However, in the hendrydong/preference_700K dataset, the context labeled as 'user' is the prompt, and the context labeled as 'assistant' is the response.

This raises the question: Does the RLHFlow/iterative-prompt-v1-iter1-20K dataset include responses given the prompt? If so, could this be considered information leakage within the dataset? In my opinion, the RLHFlow/iterative-prompt-v1-iter1-20K dataset should only include the prompts, not the responses.

Answer 1 · 2024-08-11T13:55:40.000Z

Hi, I think this is because some of the prompt are multi-turn, which means that it consists of a conversation history instead of a single instruction. For instance, many of the prompts from HH-RLHF are multi-turn.

Answer 2 · 2024-08-11T13:56:55.000Z

Handling multi-turn chat is more complicated. If you only want to work with single-turn chat, you can try the ultra feedback prompt in : weqweasdas/ultra_prompt_split

Answer 3 · 2024-08-11T19:04:56.000Z

Hi @WeiXiongUST , can you suggest some repositories or instructions that hint us how to handle multi-turn chat?

Answer 4 · 2024-08-12T00:48:55.000Z

Hi @WeiXiongUST , can you suggest some repositories or instructions that hint us how to handle multi-turn chat?

The reward modeling present in the RLHF Workflow can handle multi-turn chat because we include the HH-RLHF in the training data. The DPO and PPO also naturally handle the multi-turn chat by using the multi-turn conversation history as a prompt.

To learn multi-turn in the response part, we need new algorithms. I have a paper on this topic but is still under the internal review of google...

Answer 5 · 2024-08-12T02:07:10.000Z

Hi, I think this is because some of the prompt are multi-turn, which means that it consists of a conversation history instead of a single instruction. For instance, many of the prompts from HH-RLHF are multi-turn.

so in the hendrydong/preference_700K dataset, the choosen[:-1] (or rejected[:-1]) can be regarded as prompt? and choosen[-1] can be regarded as response?

Answer 6 · 2024-08-12T05:13:32.000Z

Hi, I think this is because some of the prompt are multi-turn, which means that it consists of a conversation history instead of a single instruction. For instance, many of the prompts from HH-RLHF are multi-turn.

so in the hendrydong/preference_700K dataset, the choosen[:-1] (or rejected[:-1]) can be regarded as prompt? and choosen[-1] can be regarded as response?

Yes, you are correct!

Answer 7 · 2024-08-12T05:15:04.000Z

Hi, I think this is because some of the prompt are multi-turn, which means that it consists of a conversation history instead of a single instruction. For instance, many of the prompts from HH-RLHF are multi-turn.

so in the hendrydong/preference_700K dataset, the choosen[:-1] (or rejected[:-1]) can be regarded as prompt? and choosen[-1] can be regarded as response?

Yes, you are correct!

Thanks a lot for your reply! Thanks!

Answer 8 · 2024-08-16T19:41:12.000Z

Hi,

Could you tell me how to select prompts to form the iterative dataset?