/PTT5

Repository for training T5 to portuguese.

Primary LanguageShellMIT LicenseMIT

PTT5

Pre-training and validating the T5 transformer in Brazilian Portuguese data

Citation

We are preparing an arXiv submission and soon will provide a citation. For now, if you need to cite use:

@misc{ptt5_2020,
  Author = {Diedre Carmo, Marcos Piau, Israel Campiotti, Rodrigo Nogueira, Roberto Lotufo},
  Title = {PTT5: Pre-training and validating the T5 transformer in Brazilian Portuguese data},
  Year = {2020},
  Publisher = {GitHub},
  Journal = {GitHub repository},
  Howpublished = {\url{https://github.com/dl4nlp-rg/PTT5}}
}

How to use PTT5:

Weight downloads:

Tamanho Vocab Epoch Link
Base T5 4 https://www.dropbox.com/s/pu18znurr6vqbio/ptt5-4epoch-standard-vocab-base-1229941.pth?dl=0
Base custom PT 4 https://www.dropbox.com/s/y0a1ea02bivjt60/ptt5-custom-vocab-base-1229942.pth?dl=0
Large T5 4 https://www.dropbox.com/s/7btqekm7mfysdeb/ptt5-standard-vocab-large-1461673.pth?dl=0
Large custom PT 4 https://www.dropbox.com/s/20zxpgz7guurn33/ptt5-custom-vocab-large-1460784.pth?dl=0
Large custom PT 2 https://www.dropbox.com/s/jchdt8s5iazko8l/ptt5-2poch-custom-vocab-large-1230742.pth?dl=0

Soon we will make our model available in HuggingFace.

Loading weights

Get the config files in: assin/T5_configs_json

Example loading with T5ForConditionalGeneration, ckpt_path is the path to the .pth weigh.:

from transformers import PretrainedConfig, T5ForConditionalGeneration

config = PretrainedConfig.from_json_file(config_path)
state_dict = torch.load(ckpt_path)

self.t5 = T5ForConditionalGeneration.from_pretrained(pretrained_model_name_or_path=None,
                                                     config=config,
                                                     state_dict=state_dict)

Load PT custom vocab

To load the custom vocabulary use the .model in: assin/custom_vocab/spm_32000_unigram Example loading vocabulary:

import sentencepiece as spm
from transformers import T5Tokenizer

def get_custom_vocab():
    # Path to SentencePiece model
    SP_MODEL_PATH = 'custom_vocab/spm_32000_unigram/spm_32000_pt.model'

    # Loading on sentencepiece
    sp = spm.SentencePieceProcessor()
    sp.load(SP_MODEL_PATH)

    # Loading o HuggingFace
    return T5Tokenizer.from_pretrained(SP_MODEL_PATH)

Folders

assin

Code related to ASSIN fine-tuning, validation and testing, including making plots and data. Original data source: https://sites.google.com/view/assin2/

brwac

Copy of the notebook which processed the BrWac original data on Google Colaboratory. The original data can be downloaded on https://www.inf.ufrgs.br/pln/wiki/index.php?title=BrWaC

pretraining

Scripts and code related to using Google Cloud TPUs for pre-training and making plots.

utils

Some utility code.

vocab

Code related to the creation of the custom Portuguese vocabulary.

Acknowledgement

This work was developed as the final project for the IA376E course taught by Professors Rodrigo Souza and Roberto Lotufo at the State University of Campinas (UNICAMP).