google-research/text-to-text-transfer-transformer

Clarification on T5 Model Pre-training Objective and Denoising Process

AliHaiderAhmad001 opened this issue · 0 comments

I am currently developing a T5 model (encoder-decoder architecture) from scratch for educational purposes. While working on this project, I've encountered some confusion regarding the pre-training objective, specifically the denoising objective. I would like to clarify my understanding and have some questions about the process.

Given the sentence:

Thank you for inviting me to your party last week.

Based on my understanding, during the pre-training phase with a denoising objective, the model works as follows:

  • Encoder input: Thank you <X> me to your party <Y> week
  • Decoder input: <X> for inviting <Y> last
  • Decoder labels (true labels): for inviting <Y> last <Z>

Here are my questions:

  1. Is my interpretation of how the encoder input, decoder input, and decoder labels are constructed correct?
  2. In this setup, the model is expected to predict sentinel tokens (e.g., <X>, <Y>). Could this potentially introduce confusion for the model, for example, it may take the idea that it is possible for the word "last" to come after the token ? Or does the model naturally learn to interpret these situations correctly?

Accordingly to the paper:

Untitled

we process the sentence Thank you for inviting me to your party last week. The words for, inviting and last are randomly chosen for corruption. Each consecutive span of corrupted tokens is replaced by a sentinel token (shown as <X> and <Y>) that is unique over the example. Since for and inviting occur consecutively, they are replaced by a single sentinel <X>. The output sequence then consists of the dropped-out spans, delimited by the sentinel tokens used to replace them in the input plus a final sentinel token <Z>.