/it5

Materials for "IT5: Large-scale Text-to-text Pretraining for Italian Language Understanding and Generation" 🇮🇹

Primary LanguageJupyter NotebookApache License 2.0Apache-2.0

IT5: Large-scale Text-to-text Pretraining for Italian Language Understanding and Generation 🇮🇹

Gabriele SartiMalvina Nissim

Abstract: The T5 model and its unified text-to-text paradigm contributed in advancing the state-of-the-art for many natural language processing tasks. While some multilingual variants of the T5 model have recently been introduced, their performances were found to provide suboptimal performances for languages other than English if compared to ad-hoc monolingual variants. Motivated by these findings we introduce IT5, the first family of encoder-decoder transformer models pretrained specifically on the Italian language. We perform a thorough cleaning of a web-crawled Italian corpus including more than 40 billion words, and use it to pretrain three IT5 models of different sizes. We then evaluate the performance of the IT5 models and their multilingual counterparts on a broad range on natural language understanding and generation benchmarks for Italian. We find the monolingual IT5 models to provide the best scale-to-performance ratio across tested models, consistently outperforming their multilingual counterparts and setting a new state-of-the-art for most Italian conditional language generation tasks.

This repository groups links and materials for the paper "IT5: Text-to-text Pretraining for Italian Language Understanding and Generation". If you use any of the following contents for your work, we kindly ask you to cite our paper:

@inproceedings{sarti-nissim-2024-it5-text,
    title = "{IT}5: Text-to-text Pretraining for {I}talian Language Understanding and Generation",
    author = "Sarti, Gabriele  and
      Nissim, Malvina",
    editor = "Calzolari, Nicoletta  and
      Kan, Min-Yen  and
      Hoste, Veronique  and
      Lenci, Alessandro  and
      Sakti, Sakriani  and
      Xue, Nianwen",
    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
    month = may,
    year = "2024",
    address = "Torino, Italy",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.lrec-main.823",
    pages = "9422--9433",
}

News

April 2022: New efficient checkpoints for the IT5 Small model, using the 32EL architecture from the Scale Efficiently paper by Google. Now even with a cased vocabulary! Available now in the demo. Thanks to Stefan Schweter for his contribution!

Web Demo

Integrated into Huggingface Spaces 🤗 using Gradio. Try out the Web Demo: Hugging Face Spaces

Pre-training Materials

  • The repository gsarti/t5-flax-gcp provides the script and a detailed explanation of the pre-training process using Huggingface and Flax on a TPU v3-8 VM via Google Cloud Platform.

  • The Cleaned Italian mC4 Corpus used for pre-training the IT5 models is made available on the Huggingface Datasets Hub under the identifier gsarti/clean_mc4_it.

  • The following pre-trained IT5 models are made available via the Huggingface Models Hub:

    • IT5 Small, encoder-decoder with 6+6 layer and 60M parameters.

    • IT5 Base, encoder-decoder with 12+12 layer and 220M parameters.

    • IT5 Large, encoder-decoder with 24+24 layer and 738M parameters.

    • New! IT5 Efficient Small, encoder-decoder with 32+6 layer and 143M parameters, using a cased vocabulary.

Experiments Materials

It is not possible for us to freely release the fine-tuning data due to access restrictions imposed by some of the original dataset creators. Please reach out at gabriele.sarti996@gmail.com showing proof of having received access to the XFORMAL dataset (procedure here) and we will be happy to provide with the preprocessed data.

This repository contains the following materials to reproduce fine-tuning experiments and evaluation:

  • The folder finetuning contains the run_seq2seq.py used to fine-tune the models on the different tasks and multiple helper files used to parametrize and run the experiments in a SLURM cluster.

  • The folder inference contains the infer.py used to predict the outputs of all tested models on all datasets and multiple helper files used to parametrize and run inference in a SLURM cluster.

  • The folder model_predictions contains all the predictions produced with the inference script for all models and tested datasets in text one-line-per-example format.

  • The notebook compute_scores.ipynb contains the code used to evaluate the performances of all the models on all the datasets. The configuration bertscore_baseline_ita.tsv is used in the notebook to compute the renormalized BERTScore values.

We release all the 54 fine-tuned model checkpoints (3 IT5 models + 1 Efficient IT5 model and 2 mT5 models on a total of 9 tasks) in the it5 collection on Huggingface. All models include Tensorboard logs for the fine-tuning procedure and are available for usage with the Huggingface Transformers library using Tensorflow, Pytorch and JAX. They can be used directly with pipelines as:

from transformers import pipelines

# e.g. to load IT5 Small trained on formal-to-informal style 
# transfer, use `gsarti/it5-small-formal-to-informal`
f2i = pipeline("text2text-generation", model='gsarti/it5-small-formal-to-informal')
f2i("Vi ringrazio infinitamente per vostra disponibilità")
>>> [{"generated_text": "e grazie per la vostra disponibilità!"}]

or loaded separately as:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# e.g. to load IT5 Small trained on headline generation,
# use `gsarti/it5-small-headline-generation` as MODEL ID.
tokenizer = AutoTokenizer.from_pretrained("<MODEL ID>")

model = AutoModelForSeq2SeqLM.from_pretrained("<MODEL ID>")

Refer to the individual model cards on the Model Hub and the original paper for more details.