Lang, C., Wachowiak, L., Heinisch, B., & Gromann, D. Transforming Term Extraction: Transformer-Based Approaches to Multilingual Term Extraction Across Domains.
This repository contains the scripts used to finetune XLM-RoBERTa for the termextraction task on the ACTER dataset (https://github.com/AylaRT/ACTER) and the ACL RD-TEC 2.0 dataset (https://github.com/languagerecipes/acl-rd-tec-2.0). One model version is used as a token classifier deciding for each single token of an input sequence simultaneously if it is a term or a continuation of a term. The other model version is a sequence classifier that decides for a given candidate term and a context in which it appears whether it is a term or not.
- transformers v.4.2.2
- torch v.1.7.0+cu101
- sentencepiece v.0.1.95
- sklearn v.0.24.1
- nltk v.3.2.5
- spacy v.2.2.4
- sacremoses v.0.0.43
- pandas v.1.1.5
- numpy v.1.19.5
Training | Test | Sequence Classifier | Token Classifier |
---|---|---|---|
EN | EN | 45.2 | 58.3 |
FR | EN | 44.7 | 44.2 |
NL | EN | 35.9 | 58.3 |
ALL | EN | 46.0 | 56.2 |
EN | FR | 48.1 | 57.6 |
FR | FR | 46.0 | 52.9 |
NL | FR | 40.0 | 54.5 |
ALL | FR | 46.7 | 55.3 |
EN | NL | 58.0 | 69.8 |
FR | NL | 56.1 | 61.4 |
NL | NL | 48.5 | 69.6 |
ALL | NL | 56.0 | 67.8 |
Data Type | Token Classifier |
---|---|
Annotator 1 | 75.8 |
Annotator 2 | 80.0 |
- optimizer: Adam
- learning rate: 2e-5
- batch size: 32
- epochs: 4
- optimizer: Adam
- learning rate: 2e-5
- batch size: 8
- epochs: Load best model at the end, evaluating the model every 100 steps