Pre-training and validating the T5 transformer in Brazilian Portuguese data
We are preparing an arXiv submission and soon will provide a citation. For now, if you need to cite use:
@misc{ptt5_2020,
Author = {Diedre Carmo, Marcos Piau, Israel Campiotti, Rodrigo Nogueira, Roberto Lotufo},
Title = {PTT5: Pre-training and validating the T5 transformer in Brazilian Portuguese data},
Year = {2020},
Publisher = {GitHub},
Journal = {GitHub repository},
Howpublished = {\url{https://github.com/dl4nlp-rg/PTT5}}
}
Weight downloads:
Soon we will make our model available in HuggingFace.
Get the config files in: assin/T5_configs_json
Example loading with T5ForConditionalGeneration, ckpt_path is the path to the .pth weigh.:
from transformers import PretrainedConfig, T5ForConditionalGeneration
config = PretrainedConfig.from_json_file(config_path)
state_dict = torch.load(ckpt_path)
self.t5 = T5ForConditionalGeneration.from_pretrained(pretrained_model_name_or_path=None,
config=config,
state_dict=state_dict)
To load the custom vocabulary use the .model in: assin/custom_vocab/spm_32000_unigram Example loading vocabulary:
import sentencepiece as spm
from transformers import T5Tokenizer
def get_custom_vocab():
# Path to SentencePiece model
SP_MODEL_PATH = 'custom_vocab/spm_32000_unigram/spm_32000_pt.model'
# Loading on sentencepiece
sp = spm.SentencePieceProcessor()
sp.load(SP_MODEL_PATH)
# Loading o HuggingFace
return T5Tokenizer.from_pretrained(SP_MODEL_PATH)
Code related to ASSIN fine-tuning, validation and testing, including making plots and data. Original data source: https://sites.google.com/view/assin2/
Copy of the notebook which processed the BrWac original data on Google Colaboratory. The original data can be downloaded on https://www.inf.ufrgs.br/pln/wiki/index.php?title=BrWaC
Scripts and code related to using Google Cloud TPUs for pre-training and making plots.
Some utility code.
Code related to the creation of the custom Portuguese vocabulary.
This work was developed as the final project for the IA376E course taught by Professors Rodrigo Souza and Roberto Lotufo at the State University of Campinas (UNICAMP).