How do I pre-train the T5 model in HuggingFace library using my own text corpus?
abhisheknovoic opened this issue ยท 17 comments
Hello,
I understand how the T5 architecture works and I have my own large corpus where I decide to mask a sequence of tokens and replace them with sentinel tokens.
I also understand about the tokenizers in HuggingFace, specially the T5 tokenizer.
Can someone point me to a document or refer me to the class that I need to use to pretrain T5 model on my corpus using the masked language model approach?
Thanks
Hi, @abhisheknovoic this might help you https://huggingface.co/transformers/model_doc/t5.html#training
check the Unsupervised denoising training section
@patil-suraj , do you mean this class? - T5ForConditionalGeneration
Also, at the top of the page, there is the following code:
lm_labels = tokenizer.encode('<extra_id_1> cute dog <extra_id_2> the <extra_id_3> </s>', return_tensors='pt')
# the forward function automatically creates the correct decoder_input_ids
model(input_ids=input_ids, lm_labels=lm_labels)
Any idea which class is the model instantiated from? I could not find any class with lm_labels parameter.
Thanks
Yes, it's T5ForConditionalGeneration
, and lm_lables
is now changed to labels
.
Pinging @patrickvonplaten for more details.
@patil-suraj , I tried the following code which throws an error. Any idea why? Thanks
In [32]: from transformers import T5Tokenizer, T5ForConditionalGeneration, T5Config
In [33]: input_ids = tokenizer.encode('The <extra_id_1> walks in <extra_id_2> park', return_tensors='pt')
In [34]: labels = tokenizer.encode('<extra_id_1> cute dog <extra_id_2> the <extra_id_3> </s>', return_tensors='pt')
In [35]: config = T5Config()
In [36]: model = T5ForConditionalGeneration(config=config)
In [37]: model(input_ids=input_ids, lm_labels=labels)
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-37-6717b0ecfbf5> in <module>
----> 1 model(input_ids=input_ids, lm_labels=labels)
/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
530 result = self._slow_forward(*input, **kwargs)
531 else:
--> 532 result = self.forward(*input, **kwargs)
533 for hook in self._forward_hooks.values():
534 hook_result = hook(self, input, result)
/usr/local/lib/python3.7/site-packages/transformers/modeling_t5.py in forward(self, input_ids, attention_mask, encoder_outputs, decoder_input_ids, decoder_attention_mask, decoder_past_key_value_states, use_cache, lm_labels, inputs_embeds, decoder_inputs_embeds, head_mask)
1068 if lm_labels is not None and decoder_input_ids is None and decoder_inputs_embeds is None:
1069 # get decoder inputs from shifting lm labels to the right
-> 1070 decoder_input_ids = self._shift_right(lm_labels)
1071
1072 # If decoding with past key value states, only the last tokens
/usr/local/lib/python3.7/site-packages/transformers/modeling_t5.py in _shift_right(self, input_ids)
609 assert (
610 decoder_start_token_id is not None
--> 611 ), "self.model.config.decoder_start_token_id has to be defined. In T5 it is usually set to the pad_token_id. See T5 docs for more information"
612
613 # shift inputs to the right
AssertionError: self.model.config.decoder_start_token_id has to be defined. In T5 it is usually set to the pad_token_id. See T5 docs for more information
My versions are
transformers==2.11.0
tokenizers==0.7.0
If you are using 2.11.0 then use lm_labels
and if you are using master then use labels
@patil-suraj , thanks. I have installed the master version. It still complains with the same error. It seems like I need to specify something for the decoder_start_token_id.
Ok, I got it working. I initialized config like follows:
config = T5Config(decoder_start_token_id=tokenizer.convert_tokens_to_ids(['<pad>'])[0])
@patil-suraj , however, if we use the master branch, it seems like the tokenizers are broken. The T5 tokenizer doesn't tokenize the sentinel tokens correctly.
@patil-suraj , do you mean this class? - T5ForConditionalGeneration
Also, at the top of the page, there is the following code:
lm_labels = tokenizer.encode('<extra_id_1> cute dog <extra_id_2> the <extra_id_3> </s>', return_tensors='pt') # the forward function automatically creates the correct decoder_input_ids model(input_ids=input_ids, lm_labels=lm_labels)
Any idea which class is the model instantiated from? I could not find any class with lm_labels parameter.
Thanks
Feel free to also open a PR to correct lm_labels
to labels
in the comment :-)
Just saw that @patil-suraj already did this - awesome thanks :-)
@abhisheknovoic regarding the T5 tokenizer, can you post some code here that shows that T5 tokenization is broken (would be great if we can easily reproduce the error)
@patrickvonplaten it would be nice if we also add seq-2-seq (t5, bart) model pre-training examples in official examples
cc @sshleifer
Definitely!
Not sure if this should be a separate issue or not, but I am having difficulty training my own T5 tokenizer. When training a BPE tokenizer using the amazing huggingface tokenizer library and attempting to load it via
tokenizer = T5Tokenizer.from_pretrained('./tokenizer')
I get the following error:
OSError: Model name './tokenizer/' was not found in tokenizers model name list (t5-small, t5-base, t5-large, t5-3b, t5-11b). We assumed './tokenizer/' was a path, a model identifier, or url to a directory containing vocabulary files named ['spiece.model'] but couldn't find such vocabulary files at this path or url.
I attempted to train a sentencepiece model instead using the, again amazing, huggingface tokenizer library, I get the same error because the tokenizer.save
method does not actual generate the spiece.model
file.
Am I doing something wrong?
Tranformers version: 2.11.0
Tokenizers version: 0.7.0
Here is a colab to reproduce the error: https://colab.research.google.com/drive/1WX1Q2Ze9k0SxFMLLv1aFgVGBFMEVTyDe?usp=sharing
@mfuntowicz @n1t0 - maybe you can help here
Definitely!
The pre-training scripts would really help.original mesh transformer is very complicated to understand.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
We've released nanoT5 that reproduces T5-model (similar to BART) pre-training in PyTorch (not Flax).
You can take a look!
Any suggestions are more than welcome.