GilBERTo is an Italian pretrained language model based on Facebook RoBERTa architecture and CamemBERT text tokenization approach.
Model was trained with the subword masking technique for 100k steps managing ~71GB of Italian text with 11,250,012,896 words (OSCAR: Open Super-large Crawled ALMAnaCH coRpus). We took up a vocabulary of 32k BPE subwords, generated using SentencePiece tokenizer.
GilBERTo evaluation was executed in different downstream tasks, comparing it to mBERT and other (not BERT-based) models. More specifically, the models comparison was accomplished by executing the following tasks:
- Part-of-Speech tagging
- Named Entity Recognition
GilBERTo is available both using huggingface/transformers and pytorch/fairseq librarires.
Model | Library | Download |
---|---|---|
GilBERTo-uncased-from-camembert |
pytorch/fairseq | GilBERTo-uncased-fairseq.v1.zip |
GilBERTo-uncased-from-camembert |
huggingface/transformers | GilBERTo-uncased-transformers.v1.zip |
We are in the drafting phase of the paper including all details (coming soon).
To the best of our knowledge, downstream task applications are limited due to the lack of datasets available for Italian. We strongly recommend everyone to contribute to the repository in order to improve the Italian NLP SOTA. We will be happy to support.
Currently we selected the following tasks based on what we have found in the Italian state of the art:
PoS task has been evaluated using the Accuracy metric with two different Italian dataset: Italian ParTUT and Italian ISDT. We also compared the results with UDPipe and UDify models.
Model | Italian ParTUT | Italian ISDT |
---|---|---|
UDPipe | 98.4 | 98.4 |
UDify | 98.2 | 98.5 |
mBERT | 98.0 | 98.5 |
GilBERTo | 98.8 | 98.6 |
NER task has been evaluated using the WikiNER Italian dataset already used by Spacy pretrained model for Italian who achieve F-1 Score:86.40; Precision:86.73; Recall:86.08
Model | F1 | Precision | Recall |
---|---|---|---|
mBERT | 92.2 | 92.1 | 92.3 |
GilBERTo | 92.7 | 92.7 | 92.8 |
You can use GilBERTo with the latest version of huggingface/transformers or pytorch/fairseq Python libraries.
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("idb-ita/gilberto-uncased-from-camembert", do_lower_case=True)
model = AutoModel.from_pretrained("idb-ita/gilberto-uncased-from-camembert")
input_ids = torch.tensor(tokenizer.encode("Io sono italiano e mi chiamo GilBERTo!")).unsqueeze(0)
#>> tensor([[5, 755, 181, 1413, 25, 155, 12513, 14397, 16247, 31976, 6]])
token_list = tokenizer.convert_ids_to_tokens(tokenizer.encode("Io sono italiano e mi chiamo GilBERTo!"))
#>> ['<s>', '▁io', '▁sono', '▁italiano', '▁e', '▁mi', '▁chiamo', '▁gil', 'berto', '!', '</s>']
$ pip install fairseq
from fairseq.models.roberta import RobertaModel as FairseqRobertaModel
from fairseq.modules import TransformerSentenceEncoderLayer
# Import GilBERTo with pytorch\fairseq Library
gilberto_model = FairseqRobertaModel.from_pretrained('path/to/checkpoints_folder',
bpe='sentencepiece')
# Mask Predictions
gilberto_model.fill_mask('Buongiorno mi <mask> Gilberto!', topk=3) #Fill mask token with GilBERTo
# Outputs
[('Buongiorno mi chiamo Gilberto!', 0.5044017434120178, ' chiamo'),
('Buongiorno mi presento Gilberto!', 0.05189879611134529, ' presento'),
('Buongiorno mi sento Gilberto!', 0.022937586531043053, ' sento')]
# Other examples
# Input: `È più facile per un italiano gesticolare senza <mask> che parlare senza gesticolare.`
# Output: `È più facile per un italiano gesticolare senza parlare che parlare senza gesticolare.`
# Input: `Agli italiani piace pasta, <mask> e mandolino`
# Output: `Agli italiani piace pasta, pizza e mandolino`
# Input: `Chi dice che il denaro non fa la <mask>, oltre a essere antipatico, è pure fesso.`
# Output: `Chi dice che il denaro non fa la felicità, oltre a essere antipatico, è pure fesso.`
# Input: `Era un uomo così antipatico che dopo la sua <mask> i parenti chiesero il bis`
# Output: `Era un uomo così antipatico che dopo la sua morte i parenti chiesero il bis`
Giulio Ravasio: Linkedin | Twitter | Github | giulio.rav@gmail.com
Leonardo Di Perna: Linkedin | Twitter | Github | dipernaleonardo@gmail.com
- RoBERTa: A Robustly Optimized BERT Pretraining Approach
- Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing
- Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures
- CamemBERT: a Tasty French Language Model
- Learning multilingual named entity recognition from Wikipedia
- 75 Languages, 1 Model: Parsing Universal Dependencies Universally