bigscience-workshop/t-zero

Truncation method

gportill opened this issue · 4 comments

Hello,

What method did you use for truncation in T0?

The experimental setup section of T0 paper states, "We feed the model input and target sequences of 1024 and 256 tokens, respectively." However, I have not been able to find whether truncation is done at the beginning or at the end of texts--or maybe another method is used.

I'm specifically wondering about this in respect to the wikihop original dataset, where many of the inputs are >1024 tokens.

Thank you!

Hi @gportill , the truncation is done in the mesh tf codebase: https://github.com/tensorflow/mesh/blob/d229a4471d8d1cea0d0a42d9c8e471950c4027c0/mesh_tensorflow/transformer/dataset.py#L479

Sequences in the incoming examples are truncated to length "length", and the
  sequences in the output examples all have fixed (padded) length "length".

Got it, looks like the end of the text is truncated.

My motive for asking this question is because with many prompts, the main "question" occurs at the end of the prompt. For multiple choice dataset prompts, the answer choices are also often listed at the end. This is the case for the 5/9 prompts for wiki_hop original, in which many inputs are >1024 tokens. All of those five prompts correspond to the original task intended by the dataset authors.

It seems like important information might be cut off with end truncation. However, since you have confirmed that end truncation was performed in the T0 experiments, I will also truncate at the end. We are trying to remain true to the data and methods used in T0.

Thanks again!

Yes that's right.
Fwiw, we did remove datasets where the inputs was in majority longer than 1024 tokens, so although wikihop have some very long inputs, these examples should be pretty rare at the scale of the T0* mixtures!

Good to know.

I am working with the validation splits of the multiple choice datasets used to train T0, and I counted the number of inputs (once prompts have been applied) that had >= 1024 tokens. Wikihop, which has 5,129 examples in total, was the only dataset that had any inputs >= 1024 tokens.

Number of inputs with length >= 1024 tokens in wikihop original:

  • wiki_hop_original_choose_best_object_interrogative_1: 3867
  • wiki_hop_original_choose_best_object_interrogative_2: 3867
  • wiki_hop_original_choose_best_object_affirmative_3: 3906
  • wiki_hop_original_choose_best_object_affirmative_2: 3885
  • wiki_hop_original_choose_best_object_affirmative_1: 3885

(I'm working with only a subset of the prompts, so I only provide stats for the prompts above.)

I downloaded the splits as you do in t-zero/blob/master/evaluation/run_eval.py, so it's strange that so many of the examples in the validation split are so long.

Just wanted to leave a note for record-keeping. We'll truncate at the end in order to remain true to the T0 methods. Thanks again!