In this work, we propose the creation of cross-lingual fact-to-text dataset, XAlign
accepted at WebConf-2022 poster-demo track. It consist of English WikiData triples/facts mapped to sentences from low resources Wikipedia.
We explored two different unsupervised methods to solve cross-lingual alignment task based on:
- Transfer learning from NLI
- Distant supervision from another English-only dataset
This repository consists of steps for executing the cross-lingual alignment approaches and finetuning mT5 for data-to-text generation on XAlign. One can find more details, analyses, and baseline results in our paper.
Install the required packgaes as follow:
pip install -r requirements.txt
- v2.0 (Sep 2022): Extended to four additional languages: Punjabi (pa), Malayalam (ml), Assamese (as) and Oriya (or).
- v1.0 (Apr 2022): Introduced the cross-lingual data-to-text into 8 langauges: Hindi (hi), Marathi (mr), Gujarati (gu), Telugu (te), Tamil (ta), Kannada (kn), Bengali (bn), and monolingual dataset in English (en).
Each record consist of the following entries:
- sentence (string) : Native language wikipedia sentence. (non-native language strings were removed.)
facts
(List[Dict]) : List of facts associated with the sentence where each fact is stored as dictionary.- language (string) : Language identifier.
The facts
key contains list of facts where each facts is stored as dictionary. A single record within fact list contains following entries:
- subject (string) : central entity.
- object (string) : entity or a piece of information about the subject.
- predicate (string) : relationship that connects the subject and the object.
- qualifiers (List[Dict]) : It provide additional information about the fact, is stored as list of qualifier where each record is a dictionary. The dictionary contains two keys: qualifier_predicate to represent property of qualifer and qualifier_object to store value for the qualifier's predicate.
Example from English
{
"sentence": "Mark Paul Briers (born 21 April 1968) is a former English cricketer.",
"facts": [
{
"subject": "Mark Briers",
"predicate": "date of birth",
"object": "21 April 1968",
"qualifiers": []
},
{
"subject": "Mark Briers",
"predicate": "occupation",
"object": "cricketer",
"qualifiers": []
},
{
"subject": "Mark Briers",
"predicate": "country of citizenship",
"object": "United Kingdom",
"qualifiers": []
}
],
"language": "en"
}
Example from one of the low-resource languages (i.e. Hindi)
{
"sentence": "बोरिस पास्तेरनाक १९५८ में साहित्य के क्षेत्र में नोबेल पुरस्कार विजेता रहे हैं।",
"facts": [
{
"subject": "Boris Pasternak",
"predicate": "nominated for",
"object": "Nobel Prize in Literature",
"qualifiers": [
{
"qualifier_predicate": "point in time",
"qualifier_object": "1958"
}
]
}
],
"language": "hi"
}
We manually annotated the test dataset across 8 languages with the help of crowd-sourced annotators.
Language | #Count | #Word count (avg/min/max) | #Facts/sentence (avg/min/max) |
---|---|---|---|
Hindi | 842 | 11.1/5/24 | 2.1/1/5 |
Marathi | 736 | 12.7/6/40 | 2.1/1/8 |
Telugu | 734 | 9.7/5/30 | 2.2/1/6 |
Tamil | 656 | 9.5/5/24 | 1.9/1/8 |
English | 470 | 17.5/8/61 | 2.7/1/7 |
Gujarati | 530 | 12.7/6/31 | 2.1/1/6 |
Bengali | 792 | 8.7/5/24 | 1.6/1/5 |
Kannada | 642 | 10.4/6/45 | 2.2/1/7 |
Oriya | 529 | 13.4/5/45 | 2.4/1/7 |
Assamese | 637 | 16.22/5/72 | 2.2/1/9 |
Malayalam | 615 | 9.2/6/24 | 1.8/1/5 |
Punjabi | 529 | 13.4/5/45 | 2.4/1/7 |
We have automatically created a large collection of well aligned sentence-fact pair across languages using the best cross-lingual aligner evaluated on gold standard test datasets.
Language | #Count | #Word Count (avg/min/max) | #Facts/sentence (avg/min/max) |
---|---|---|---|
Hindi | 56582 | 25.3/5/99 | 2.0/1/10 |
Marathi | 19408 | 20.4/5/94 | 2.2/1/10 |
Telugu | 24344 | 15.6/5/97 | 1.7/1/10 |
Tamil | 56707 | 16.7/5/97 | 1.8/1/10 |
English | 132584 | 20.2/4/86 | 2.2/1/10 |
Gujarati | 9031 | 23.4/5/99 | 1.8/1/10 |
Bengali | 121216 | 19.3/5/99 | 2.0/1/10 |
Kannada | 25441 | 19.3/5/99 | 1.9/1/10 |
Oriya | 14333 | 16.88/5/99 | 1.7/1/10 |
Assamese | 9707 | 19.23/5/99 | 1.6/1/10 |
Malayalam | 55135 | 15.7/5/98 | 1.9/1/10 |
Punjabi | 30136 | 32.1/5/99 | 2.1/1/10 |
Before executing the code, download the XNLI dataset from here.
To execute the mT5
based approach, follow the steps:
$ cd XNLI-based-models/finetune_mt5
Copy xnli_dataset.zip (downloaded before) to ./datasets
and unzip. Finally execute the command:
$ python main.py --epochs 5 --gpus 1 --batch_size 16 --max_seq_len 200 --learning_rate 1e-3 --model_name google/mt5-large --fp16 0
To execute the MuRIL
or XLM-RoBERTa
based approaches, follow the steps:
$ cd XNLI-based-models/finetune_multilingual_encoder_models
Copy xnli_dataset.zip (downloaded before) to ./datasets
and unzip.Finally execute the command:
$ python main.py --epochs 5 --gpus 1 --batch_size 32 --max_seq_len 200 --learning_rate <lr> --model_name <model_name> --fp16 1
where,
model_name
can be 'google/muril-large-cased' or 'xlm-roberta-large'learning_rate
must be '1e-5' for 'xlm-roberta-large' or '2e-5' for 'google/muril-large-cased'
Before executing the code, download the multi-lingual KELM dataset from here.
To execute the mT5
based approach, follow the steps:
$ cd distant_supervision/finetune_mt5
Copy multilingual-KELM-dataset.zip (downloaded before) to ./datasets
directory and unzip. Finally execute the command:
$ python main.py --epochs 5 --gpus 1 --batch_size 16 --max_seq_len 200 --learning_rate 1e-3 --model_name google/mt5-large --fp16 0
To execute the MuRIL
or XLM-RoBERTa
based approaches, follow the steps:
$ cd distant_supervision/finetune_multilingual_encoder_models
Copy multilingual-KELM-dataset.zip (downloaded before) to ./datasets
directory and unzip. Finally execute the command:
$ python main.py --epochs 5 --gpus 1 --batch_size 32 --max_seq_len 200 --learning_rate <lr> --model_name <model_name> --fp16 1
where,
model_name
can be 'google/muril-large-cased' or 'xlm-roberta-large'learning_rate
must be '1e-5' for 'xlm-roberta-large' or '2e-5' for 'google/muril-large-cased'
Following are the F1-score for cross-lingual alignment over gold standard test datasets.
Hindi | Marathi | Telugu | Tamil | English | Gujarati | Bengali | Kannada | Average | |
---|---|---|---|---|---|---|---|---|---|
Baselines | |||||||||
KELM-style | 49.3 | 42.6 | 36.8 | 45.1 | 41.0 | 37.2 | 43.6 | 33.8 | 41.1 |
WITA-style | 50.7 | 57.4 | 51.7 | 45.9 | 60.2 | 50.0 | 53.5 | 53.0 | 52.8 |
Stage-1 + TF-IDF | 75.0 | 68.5 | 69.3 | 71.8 | 73.7 | 70.1 | 78.7 | 64.7 | 71.5 |
Distant Supervision based approaches | |||||||||
MuRIL-large | 76.3 | 68.4 | 74.0 | 75.5 | 70.5 | 78.5 | 62.4 | 67.7 | 71.7 |
XLM-Roberta-large | 78.1 | 69.0 | 76.5 | 73.9 | 76.5 | 78.5 | 66.9 | 72.4 | 74.0 |
mT5-large | 79.0 | 71.4 | 77.6 | 78.6 | 76.6 | 80.0 | 69.8 | 70.5 | 75.4 |
Transfer Learning based approaches | |||||||||
MuRIL-large | 71.6 | 71.7 | 76.5 | 75.1 | 73.4 | 78.7 | 79.5 | 71.8 | 74.8 |
XLM-Roberta-large | 77.2 | 76.7 | 78.0 | 81.2 | 79.0 | 80.5 | 83.1 | 72.7 | 78.6 |
mT5-large | 90.2 | 83.1 | 84.1 | 88.6 | 84.5 | 85.1 | 75.1 | 78.5 | 83.7 |
Before procedding, copy XAlign-dataset.zip (available upon request) to data-to-text-generator/mT5-baseline/datasets
folder and unzip.
To finetune the best baseline on the XAlign, follow the steps:
$ cd data-to-text-generator/mT5-baseline
$ python main.py --epochs 30 --gpus 1 --batch_size 2 --src_max_seq_len 250 --tgt_max_seq_len 200 --learning_rate 1e-3 --model_name google/mt5-small --online_mode 0 --use_pretrained 1 --lang hi,mr,te,ta,en,gu,bn,kn --verbose --enable_script_unification 1
To evaluate the trained model, follow the steps:
$ cd data-to-text-generator/mT5-baseline
$ python main.py --epochs 30 --gpus 1 --batch_size 4 --src_max_seq_len 250 --tgt_max_seq_len 200 --learning_rate 1e-3 --model_name google/mt5-small --online_mode 0 --use_pretrained 1 --lang hi,mr,te,ta,en,gu,bn,kn enable_script_unification 1 --inference
BLEU obtained on the Test dataset on XAlign.
Hindi | Marathi | Telugu | Tamil | English | Gujarati | Bengali | Kannada | Average | |
---|---|---|---|---|---|---|---|---|---|
Baseline (fact translation) | 2.71 | 2.04 | 0.95 | 1.68 | 1.01 | 0.64 | 2.73 | 0.45 | 1.53 |
GAT-Transformer | 29.54 | 17.94 | 4.91 | 7.19 | 40.33 | 11.34 | 30.15 | 5.08 | 18.31 |
Vanilla Transformer | 35.42 | 17.31 | 6.94 | 8.82 | 38.87 | 13.21 | 35.61 | 3.16 | 19.92 |
mT5-small | 40.61 | 20.23 | 11.39 | 13.61 | 43.65 | 16.61 | 45.28 | 8.77 | 25.02 |
- Tushar Abhishek
- Shivprasad Sagare
- Bhavyajeet Singh
- Anubhav Sharma
- Manish Gupta
- Vasudeva Varma
One can cite our paper as follows:
@article{abhishek2022xalign,
title={XAlign: Cross-lingual Fact-to-Text Alignment and Generation for Low-Resource Languages},
author={Abhishek, Tushar and Sagare, Shivprasad and Singh, Bhavyajeet and Sharma, Anubhav and Gupta, Manish and Varma, Vasudeva},
journal={arXiv preprint arXiv:2202.00291},
year={2022}
}