A question about training data

Question

A question about training data

gao8319 opened this issue a year ago · 1 comments

I hope this message finds you well.

I am working with training dataset that includes fields named start_token_idx and end_token_idx, and I am trying to understand the methodology used for calculating these indices.

Could you please provide some insights or share the method used to calculate these indices in your dataset? Any details about the algorithms or the approach you followed would be extremely helpful.

Thanks for your reply.

Answer 1 · 2023-12-08T03:10:52.000Z

Hi, thanks for your interest in our work. The workflow to preprocess the dataset is as below:

First, the raw data format is interleaved with text and tool calls, e.g., The answer is 1 + 1 = <add>(1, 1)=2</add>2.
- As mentioned in the paper, we can acquire this data either from human-annotated datasets (like GSM8k) or by prompting LLMs with self-instruct.
Tokenize the sequence from the start, until a function call is met, e.g., at the first we tokenize The answer is 1 + 1 =. Record the the current token sequence length as the start_token_idx of this call.
Remove the tool call syntax, and continue to tokenize, until the return value of this function call is fully tokenized. For this example, it's The answer is 1 + 1 = 2. Record the current token sequence length as end_token_idx of this call.
Repeat the previous 2 steps until the full sequence is tokenized (because there can be more than 1 tool calls in a training sample)

In this way, we finally get a sequence without tool calling syntax, i.e. "The answer is 1 + 1 = 2.", and we know the tokens in [start_token_idx, end_token_idx) should be generated by a tool call. So in the training code, the target token at start_token_idx is labeled with a "toolken", while the remaining tokens are masked, as they shouldn't be predicted by next token prediction. See here

The preprocessing can be tricky, so you might want to double check whether the tokens in [start_token_idx, end_token_idx) are the same as return values of tool calls.

Feel free to follow up if you have any questions