How do I pre-train the T5 model in HuggingFace library using my own text corpus?

Question

How do I pre-train the T5 model in HuggingFace library using my own text corpus?

abhisheknovoic opened this issue 4 years ago · 17 comments

Hello,

I understand how the T5 architecture works and I have my own large corpus where I decide to mask a sequence of tokens and replace them with sentinel tokens.

I also understand about the tokenizers in HuggingFace, specially the T5 tokenizer.

Can someone point me to a document or refer me to the class that I need to use to pretrain T5 model on my corpus using the masked language model approach?

Thanks

Answer 1 · 2020-06-17T10:54:35.000Z

Hi, @abhisheknovoic this might help you https://huggingface.co/transformers/model_doc/t5.html#training
check the Unsupervised denoising training section

Answer 2 · 2020-06-17T11:15:54.000Z

@patil-suraj , do you mean this class? - T5ForConditionalGeneration

Also, at the top of the page, there is the following code:

lm_labels = tokenizer.encode('<extra_id_1> cute dog <extra_id_2> the <extra_id_3> </s>', return_tensors='pt')
# the forward function automatically creates the correct decoder_input_ids
model(input_ids=input_ids, lm_labels=lm_labels)

Any idea which class is the model instantiated from? I could not find any class with lm_labels parameter.

Thanks

Answer 3 · 2020-06-17T11:19:02.000Z

Yes, it's T5ForConditionalGeneration, and lm_lables is now changed to labels.

Pinging @patrickvonplaten for more details.

Answer 4 · 2020-06-17T11:50:44.000Z

@patil-suraj , I tried the following code which throws an error. Any idea why? Thanks

In [32]: from transformers import T5Tokenizer, T5ForConditionalGeneration, T5Config

In [33]: input_ids = tokenizer.encode('The <extra_id_1> walks in <extra_id_2> park', return_tensors='pt')

In [34]: labels = tokenizer.encode('<extra_id_1> cute dog <extra_id_2> the <extra_id_3> </s>', return_tensors='pt')

In [35]: config = T5Config()

In [36]: model = T5ForConditionalGeneration(config=config)

In [37]: model(input_ids=input_ids, lm_labels=labels)
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-37-6717b0ecfbf5> in <module>
----> 1 model(input_ids=input_ids, lm_labels=labels)

/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    530             result = self._slow_forward(*input, **kwargs)
    531         else:
--> 532             result = self.forward(*input, **kwargs)
    533         for hook in self._forward_hooks.values():
    534             hook_result = hook(self, input, result)

/usr/local/lib/python3.7/site-packages/transformers/modeling_t5.py in forward(self, input_ids, attention_mask, encoder_outputs, decoder_input_ids, decoder_attention_mask, decoder_past_key_value_states, use_cache, lm_labels, inputs_embeds, decoder_inputs_embeds, head_mask)
   1068         if lm_labels is not None and decoder_input_ids is None and decoder_inputs_embeds is None:
   1069             # get decoder inputs from shifting lm labels to the right
-> 1070             decoder_input_ids = self._shift_right(lm_labels)
   1071
   1072         # If decoding with past key value states, only the last tokens

/usr/local/lib/python3.7/site-packages/transformers/modeling_t5.py in _shift_right(self, input_ids)
    609         assert (
    610             decoder_start_token_id is not None
--> 611         ), "self.model.config.decoder_start_token_id has to be defined. In T5 it is usually set to the pad_token_id. See T5 docs for more information"
    612
    613         # shift inputs to the right

AssertionError: self.model.config.decoder_start_token_id has to be defined. In T5 it is usually set to the pad_token_id. See T5 docs for more information

My versions are

transformers==2.11.0
tokenizers==0.7.0

Answer 5 · 2020-06-17T11:55:12.000Z

If you are using 2.11.0 then use lm_labels and if you are using master then use labels

Answer 6 · 2020-06-17T11:59:10.000Z

@patil-suraj , thanks. I have installed the master version. It still complains with the same error. It seems like I need to specify something for the decoder_start_token_id.

Answer 7 · 2020-06-17T12:01:14.000Z

Ok, I got it working. I initialized config like follows:

config = T5Config(decoder_start_token_id=tokenizer.convert_tokens_to_ids(['<pad>'])[0])

Answer 8 · 2020-06-17T12:05:35.000Z

@patil-suraj , however, if we use the master branch, it seems like the tokenizers are broken. The T5 tokenizer doesn't tokenize the sentinel tokens correctly.

Answer 9 · 2020-06-18T07:09:42.000Z

@patil-suraj , do you mean this class? - T5ForConditionalGeneration

Also, at the top of the page, there is the following code:
lm_labels = tokenizer.encode('<extra_id_1> cute dog <extra_id_2> the <extra_id_3> </s>', return_tensors='pt')
# the forward function automatically creates the correct decoder_input_ids
model(input_ids=input_ids, lm_labels=lm_labels)
Any idea which class is the model instantiated from? I could not find any class with lm_labels parameter.

Thanks

Feel free to also open a PR to correct lm_labels to labels in the comment :-)

Answer 10 · 2020-06-18T07:11:57.000Z

Just saw that @patil-suraj already did this - awesome thanks :-)

@abhisheknovoic regarding the T5 tokenizer, can you post some code here that shows that T5 tokenization is broken (would be great if we can easily reproduce the error)

Answer 11 · 2020-06-18T07:14:20.000Z

@patrickvonplaten it would be nice if we also add seq-2-seq (t5, bart) model pre-training examples in official examples

cc @sshleifer

Answer 12 · 2020-06-18T08:01:23.000Z

Definitely!

Answer 13 · 2020-06-24T20:07:03.000Z

Not sure if this should be a separate issue or not, but I am having difficulty training my own T5 tokenizer. When training a BPE tokenizer using the amazing huggingface tokenizer library and attempting to load it via

tokenizer = T5Tokenizer.from_pretrained('./tokenizer')

I get the following error:

OSError: Model name './tokenizer/' was not found in tokenizers model name list (t5-small, t5-base, t5-large, t5-3b, t5-11b). We assumed './tokenizer/' was a path, a model identifier, or url to a directory containing vocabulary files named ['spiece.model'] but couldn't find such vocabulary files at this path or url.

I attempted to train a sentencepiece model instead using the, again amazing, huggingface tokenizer library, I get the same error because the tokenizer.save method does not actual generate the spiece.model file.

Am I doing something wrong?

Tranformers version: 2.11.0
Tokenizers version: 0.7.0

Here is a colab to reproduce the error: https://colab.research.google.com/drive/1WX1Q2Ze9k0SxFMLLv1aFgVGBFMEVTyDe?usp=sharing

Answer 14 · 2020-06-25T09:41:23.000Z

@mfuntowicz @n1t0 - maybe you can help here

Answer 15 · 2020-07-01T04:53:27.000Z

Definitely!

The pre-training scripts would really help.original mesh transformer is very complicated to understand.

Answer 16 · 2020-08-30T07:47:46.000Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Answer 17 · 2023-03-16T16:29:24.000Z

We've released nanoT5 that reproduces T5-model (similar to BART) pre-training in PyTorch (not Flax).

You can take a look!

Any suggestions are more than welcome.