/BasqueGLUE

Basque NLU evaluation framework / Euskarazko hizkuntza naturalaren ulermena ebaluatzeko euskarria

Primary LanguagePython

BasqueGLUE

A Natural Language Understanding Benchmark for Basque

Euskarazko Hizkuntza Naturalaren Ulermena Ebaluatzeko Euskarria

Natural Language Understanding (NLU) technology has improved significantly over the last few years, and multitask benchmarks such as GLUE are key to evaluate this improvement in a robust and general way. These benchmarks take into account a wide and diverse set of NLU tasks that require some form of language understanding, beyond the detection of superficial, textual clues. However, they are costly to develop and language-dependent, and therefore they are only available for a small number of languages.

We present BasqueGLUE, the first NLU benchmark for Basque, which has been elaborated from previously existing datasets and following similar criteria to those used for the construction of GLUE and SuperGLUE. BasqueGLUE is freely available under an open license.

Work published at LREC22: BasqueGLUE: A Natural Language Understanding Benchmark for Basque

NOW ALSO AVAILABLE AT HF🤗 Datasets!

The 9 tasks included in BasqueGLUE:

Dataset |Train| |Val| |Test| Task Metric Domain
NERCid 51,539 12,936 35,855 NERC F1 News
NERCood 64,475 14,945 14,462 NERC F1 News, Wikipedia
FMTODeu_intent 3,418 1,904 1,087 Intent classification F1 Dialog system
FMTODeu_slot 19,652 10,791 5,633 Slot filling F1 Dialog system
BHTCv2 8,585 1,857 1,854 Topic classification F1 News
BEC2016eu 6,078 1,302 1,302 Sentiment analysis F1 Twitter
VaxxStance 864 206 312 Stance detection MF1* Twitter
QNLIeu 1,764 230 238 QA/NLI Acc Wikipedia
WiCeu 408,559 600 1,400 WSD Acc Wordnet
EpecKorrefBin 986 320 587 Coreference resolution Acc News

NERCid stands for NERC in-domain, while NERCood stands for NERC out-of-domain. Dataset sizes for sequence labeling tasks (NERC and Slot filling) are given in tokens. Acc refers to accuracy, while F1 refers to micro-average F1-score. *The metric used for VaxxStance is macro-average F1-score of two classes: FAVOR and AGAINST.

Note: The train file for WiCeu needs to be uncompressed.

Examples of each task:

NERC
Tokens: Helburuetako bat McLareni eta Ferrariri aurre egitea izango du taldeak .
Labels: O O B-ORG O B-ORG O O O O O O
Translation: One of the objectives that will have the team is to confront McLaren and Ferrari.
Intent Class (FMTODeu_intent)
Text: alarma ezarri gaurko 6:00etan
Translation: set the alarm today at 6:00am
Intent: alarm/set alarm
Slot Filling (FMTODeu_slot)
Tokens: Euria egingo du gaur ?
Labels: B-weather/attribute O O B-datetime O
Translation: Is it going to rain today?
Topic Classification (BHTCv2)
Text: Gurasotasun baimena eta seme-alabak zaintzeko baimena lau hilabetera luzatzeko proposamena egitea onartu du Europako Batzordeak. Proposamenak aldaketa handia ekarriko luke Hego Euskal Herrian, lau asteetara luzatu berri baita baimen hori.
Translation: The European Commission has approved to make the proposal of extending paternity leave to four months. The proposal would represent an important change in Hego Euskal Herria, as it has been extended recently to four weeks.
Topic: Gizartea (society)
Sentiment Analysis (BEC)
Text: Mezu txoro, patetiko eta lotsagarri hori ongi hartuko duenik badela uste du PSEk.
translation: PSE thinks there are people who will respond positively to that crazy, pathetic and shameful message.
Polarity: Negative
Stance Detection (VaxxStance)
Text: Gure nagusiak babestuko dituen txertoa martxan da. Zor genien. Gaur mundua apur bat hobeagoa da. #OsasunPublikoarenGaraipena #GureGaraipena
Translation: The vaccine that will protect our elderly people is on it’s way. We owned them. Today the world is a little bit better. #TheVictoryOfPublicHealthcare #OurVictory
Stance: FAVOR
QNLI
Question: “Irrintziaren oihartzunak” dokumentalaz gain, zein best lan egin ditu zinema arloan?
Translation: Aside from the documentary “Irrintziaren ohiartzunak”, in what other projects has she worked on in the field of cinema?
Sentence: “Irrintziaren oihartzunak” du lehen filma zuzendari eta gidoilari gisa.
Translation: “Irrintziaren oihartzunak” is her first film as a director and scriptwriter.
NLI: not_entailment
WiC
Sentence1: Asterix, zazpi [egunen] segida asmatu zuen galiarra .
Translation: Asterix, the Gaul who invented the 7 [days] week.
Sentence2: Etxeko landareek sasoi aktiboan tenperatura epelak behar dituzte : [egunez] 25 C ingurukoak .
Translation: House plants need warm temperatures during active season: around 25C in [daylight].
Same sense: False
Coreference (EpecKorrefBin)
Text: Birmoldaketan daudenen artean [Katalunia , Madril , Hego Euskal Herria , Aragoi , Balear irlak eta Errioxa] aurkitzen dira . [Horien] artean , Hego Euskal Herriak 47.870 milioi pezeta jasoko ditu .
Translation: Among those under reconversion are [Catalonia, Madrid, Southern Basque Country, Aragon, Balearic islands and Rioja] . Among [them], the Southern Basque Country will receive 47,870 million pesetas.
Coreference: True

For more details, each dataset is provided with their corresponding README file.

Evaluation

We provide an evaluation python script. Finetuning is left up to the user, script evaluates predictions provided on each task against test gold standards (test.jsonl files). The script expects the same format the datasets have.

python3 eval_basqueglue.py  \
        --task [nerc_id | nerc_od | intent | slot | bhtc | bec | vaxx | qnli | wic | coref] \
        --pred prediction_file.jsonl \
        --ref reference_file.jsonl #(usually test.jsonl)

Results from the paper:

We evaluated 2 language models, BERTeus and ElhBERTeu, finetuning them on each task independently. We used a lr of 3e-5 and a batch size of 32. We finetuned each model 5 times, up to 10 epochs, and choose the best performing checkpoint over validation split, to obtain the results on the test split on a single run. The results obtained on NERC are the average of in domain and out of domain NERC.

AVG NERC F_intent F_slot BHTC BEC Vaxx QNLI WiC coref
Model F1 F1 F1 F1 F1 MF1 acc acc acc
BERTeus 73.23 81.92 82.52 74.34 78.26 69.43 59.30 74.26 70.71 68.31
ElhBERTeu 73.71 82.30 82.24 75.64 78.05 69.89 63.81 73.84 71.71 65.93

Authors

Gorka Urbizu [1], Iñaki San Vicente [1], Xabier Saralegi [1], Rodrigo Agerri [2] and Aitor Soroa [2]

Affiliation of the authors:

[1] orai NLP Technologies

[2] HiTZ Center - Ixa, University of the Basque Country UPV/EHU

Licensing

Each dataset of the BasqueGLUE benchmark has it's own license (due to most of them being or being derived from already existing datasets). See their respective README files for details.

But, here we provide a brief summary of them:

Dataset License
NERCid CC BY-NC-SA 4.0
NERCood CC BY-NC-SA 4.0
FMTODeu_intent CC BY-NC-SA 4.0
FMTODeu_slot CC BY-NC-SA 4.0
BHTCv2 CC BY-NC-SA 4.0
BEC2016eu Twitter's license + CC BY-NC-SA 4.0
VaxxStance Twitter's license + CC BY 4.0
QNLIeu CC BY-SA 4.0
WiCeu CC BY-NC-SA 4.0
EpecKorrefBin CC BY-NC-SA 4.0

For the rest of the files of the benchmark, including the evaluation script, the following license applies:

Copyright (C) by Orai NLP Technologies. This benchmark and evaluation scripts are licensed under the Creative Commons Attribution Share Alike 4.0 International License (CC BY-SA 4.0). To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Acknowledgements

If you use this benchmark please cite the following paper:

  • G. Urbizu, I. San Vicente, X. Saralegi, R. Agerri, A. Soroa. BasqueGLUE: A Natural Language Understanding Benchmark for Basque. In proceedings of the 13th Language Resources and Evaluation Conference (LREC 2022). June, 2022. Marseille, France
@InProceedings{urbizu2022basqueglue,
  author    = {Urbizu, Gorka  and  San Vicente, Iñaki  and  Saralegi, Xabier  and  Agerri, Rodrigo  and  Soroa, Aitor},
  title     = {BasqueGLUE: A Natural Language Understanding Benchmark for Basque},
  booktitle      = {Proceedings of the Language Resources and Evaluation Conference},
  month          = {June},
  year           = {2022},
  address        = {Marseille, France},
  publisher      = {European Language Resources Association},
  pages     = {1603--1612},
  abstract  = {Natural Language Understanding (NLU) technology has improved significantly over the last few years and multitask benchmarks such as GLUE are key to evaluate this improvement in a robust and general way. These benchmarks take into account a wide and diverse set of NLU tasks that require some form of language understanding, beyond the detection of superficial, textual clues. However, they are costly to develop and language-dependent, and therefore they are only available for a small number of languages. In this paper, we present BasqueGLUE, the first NLU benchmark for Basque, a less-resourced language, which has been elaborated from previously existing datasets and following similar criteria to those used for the construction of GLUE and SuperGLUE. We also report the evaluation of two state-of-the-art language models for Basque on BasqueGLUE, thus providing a strong baseline to compare upon. BasqueGLUE is freely available under an open license.},
  url       = {https://aclanthology.org/2022.lrec-1.172}
}

Contact information

Gorka Urbizu, Iñaki San Vicente: {g.urbizu,i.sanvicente}@orai.eus