XAlign: Cross-lingual-Fact-to-Text-Alignment-and-Generation-for-Low-Resource-Languages

In this work, we propose the creation of cross-lingual fact-to-text dataset, XAlign accepted at WebConf-2022 poster-demo track. It consist of English WikiData triples/facts mapped to sentences from low resources Wikipedia.

We explored two different unsupervised methods to solve cross-lingual alignment task based on:

Transfer learning from NLI
Distant supervision from another English-only dataset

This repository consists of steps for executing the cross-lingual alignment approaches and finetuning mT5 for data-to-text generation on XAlign. One can find more details, analyses, and baseline results in our paper.

Installation

Install the required packgaes as follow:

pip install -r requirements.txt

Dataset

Dataset releases

v2.0 (Sep 2022): Extended to four additional languages: Punjabi (pa), Malayalam (ml), Assamese (as) and Oriya (or).
v1.0 (Apr 2022): Introduced the cross-lingual data-to-text into 8 langauges: Hindi (hi), Marathi (mr), Gujarati (gu), Telugu (te), Tamil (ta), Kannada (kn), Bengali (bn), and monolingual dataset in English (en).

Data Fields

Each record consist of the following entries:

sentence (string) : Native language wikipedia sentence. (non-native language strings were removed.)
facts (List[Dict]) : List of facts associated with the sentence where each fact is stored as dictionary.
language (string) : Language identifier.

The facts key contains list of facts where each facts is stored as dictionary. A single record within fact list contains following entries:

subject (string) : central entity.
object (string) : entity or a piece of information about the subject.
predicate (string) : relationship that connects the subject and the object.
qualifiers (List[Dict]) : It provide additional information about the fact, is stored as list of qualifier where each record is a dictionary. The dictionary contains two keys: qualifier_predicate to represent property of qualifer and qualifier_object to store value for the qualifier's predicate.

Data Instances

Example from English

{
  "sentence": "Mark Paul Briers (born 21 April 1968) is a former English cricketer.",
  "facts": [
    {
      "subject": "Mark Briers",
      "predicate": "date of birth",
      "object": "21 April 1968",
      "qualifiers": []
    },
    {
      "subject": "Mark Briers",
      "predicate": "occupation",
      "object": "cricketer",
      "qualifiers": []
    },
    {
      "subject": "Mark Briers",
      "predicate": "country of citizenship",
      "object": "United Kingdom",
      "qualifiers": []
    }
  ],
  "language": "en"
}

Example from one of the low-resource languages (i.e. Hindi)

{
  "sentence": "बोरिस पास्तेरनाक १९५८ में साहित्य के क्षेत्र में नोबेल पुरस्कार विजेता रहे हैं।",
  "facts": [
    {
      "subject": "Boris Pasternak",
      "predicate": "nominated for",
      "object": "Nobel Prize in Literature",
      "qualifiers": [
        {
          "qualifier_predicate": "point in time",
          "qualifier_object": "1958"
        }
      ]
    }
  ],
  "language": "hi"
}

Gold standard Test dataset

We manually annotated the test dataset across 8 languages with the help of crowd-sourced annotators.

Language	#Count	#Word count (avg/min/max)	#Facts/sentence (avg/min/max)
Hindi	842	11.1/5/24	2.1/1/5
Marathi	736	12.7/6/40	2.1/1/8
Telugu	734	9.7/5/30	2.2/1/6
Tamil	656	9.5/5/24	1.9/1/8
English	470	17.5/8/61	2.7/1/7
Gujarati	530	12.7/6/31	2.1/1/6
Bengali	792	8.7/5/24	1.6/1/5
Kannada	642	10.4/6/45	2.2/1/7
Oriya	529	13.4/5/45	2.4/1/7
Assamese	637	16.22/5/72	2.2/1/9
Malayalam	615	9.2/6/24	1.8/1/5
Punjabi	529	13.4/5/45	2.4/1/7

Train and validation dataset (automatically aligned)

We have automatically created a large collection of well aligned sentence-fact pair across languages using the best cross-lingual aligner evaluated on gold standard test datasets.

Language	#Count	#Word Count (avg/min/max)	#Facts/sentence (avg/min/max)
Hindi	56582	25.3/5/99	2.0/1/10
Marathi	19408	20.4/5/94	2.2/1/10
Telugu	24344	15.6/5/97	1.7/1/10
Tamil	56707	16.7/5/97	1.8/1/10
English	132584	20.2/4/86	2.2/1/10
Gujarati	9031	23.4/5/99	1.8/1/10
Bengali	121216	19.3/5/99	2.0/1/10
Kannada	25441	19.3/5/99	1.9/1/10
Oriya	14333	16.88/5/99	1.7/1/10
Assamese	9707	19.23/5/99	1.6/1/10
Malayalam	55135	15.7/5/98	1.9/1/10
Punjabi	30136	32.1/5/99	2.1/1/10

Cross-lingual Alignment Approaches

1) Transfer learning from NLI

Before executing the code, download the XNLI dataset from here.

To execute the mT5 based approach, follow the steps:

$ cd XNLI-based-models/finetune_mt5

Copy xnli_dataset.zip (downloaded before) to ./datasets and unzip. Finally execute the command:

$ python main.py --epochs 5 --gpus 1 --batch_size 16 --max_seq_len 200 --learning_rate 1e-3 --model_name google/mt5-large --fp16 0

To execute the MuRIL or XLM-RoBERTa based approaches, follow the steps:

$ cd XNLI-based-models/finetune_multilingual_encoder_models

Copy xnli_dataset.zip (downloaded before) to ./datasets and unzip.Finally execute the command:

$ python main.py --epochs 5 --gpus 1 --batch_size 32 --max_seq_len 200 --learning_rate <lr> --model_name <model_name> --fp16 1

where,

model_name can be 'google/muril-large-cased' or 'xlm-roberta-large'
learning_rate must be '1e-5' for 'xlm-roberta-large' or '2e-5' for 'google/muril-large-cased'

2) Distant supervision using KELM dataset

Before executing the code, download the multi-lingual KELM dataset from here.

To execute the mT5 based approach, follow the steps:

$ cd distant_supervision/finetune_mt5

Copy multilingual-KELM-dataset.zip (downloaded before) to ./datasets directory and unzip. Finally execute the command:

$ python main.py --epochs 5 --gpus 1 --batch_size 16 --max_seq_len 200 --learning_rate 1e-3 --model_name google/mt5-large --fp16 0

To execute the MuRIL or XLM-RoBERTa based approaches, follow the steps:

$ cd distant_supervision/finetune_multilingual_encoder_models

Copy multilingual-KELM-dataset.zip (downloaded before) to ./datasets directory and unzip. Finally execute the command:

$ python main.py --epochs 5 --gpus 1 --batch_size 32 --max_seq_len 200 --learning_rate <lr> --model_name <model_name> --fp16 1

where,

model_name can be 'google/muril-large-cased' or 'xlm-roberta-large'
learning_rate must be '1e-5' for 'xlm-roberta-large' or '2e-5' for 'google/muril-large-cased'

Cross-lingual Alignment Results

Following are the F1-score for cross-lingual alignment over gold standard test datasets.

	Hindi	Marathi	Telugu	Tamil	English	Gujarati	Bengali	Kannada	Average
Baselines
KELM-style	49.3	42.6	36.8	45.1	41.0	37.2	43.6	33.8	41.1
WITA-style	50.7	57.4	51.7	45.9	60.2	50.0	53.5	53.0	52.8
Stage-1 + TF-IDF	75.0	68.5	69.3	71.8	73.7	70.1	78.7	64.7	71.5
Distant Supervision based approaches
MuRIL-large	76.3	68.4	74.0	75.5	70.5	78.5	62.4	67.7	71.7
XLM-Roberta-large	78.1	69.0	76.5	73.9	76.5	78.5	66.9	72.4	74.0
mT5-large	79.0	71.4	77.6	78.6	76.6	80.0	69.8	70.5	75.4
Transfer Learning based approaches
MuRIL-large	71.6	71.7	76.5	75.1	73.4	78.7	79.5	71.8	74.8
XLM-Roberta-large	77.2	76.7	78.0	81.2	79.0	80.5	83.1	72.7	78.6
mT5-large	90.2	83.1	84.1	88.6	84.5	85.1	75.1	78.5	83.7

Cross-lingual Data-to-Text Generation

Before procedding, copy XAlign-dataset.zip (available upon request) to data-to-text-generator/mT5-baseline/datasets folder and unzip.

To finetune the best baseline on the XAlign, follow the steps:

$ cd data-to-text-generator/mT5-baseline
$ python main.py --epochs 30 --gpus 1 --batch_size 2 --src_max_seq_len 250 --tgt_max_seq_len 200 --learning_rate 1e-3 --model_name google/mt5-small --online_mode 0 --use_pretrained 1 --lang hi,mr,te,ta,en,gu,bn,kn --verbose --enable_script_unification 1

To evaluate the trained model, follow the steps:

$ cd data-to-text-generator/mT5-baseline
$ python main.py --epochs 30 --gpus 1 --batch_size 4 --src_max_seq_len 250 --tgt_max_seq_len 200 --learning_rate 1e-3 --model_name google/mt5-small --online_mode 0 --use_pretrained 1 --lang hi,mr,te,ta,en,gu,bn,kn enable_script_unification 1 --inference

Cross-lingual Data-to-Text Generation Results

BLEU obtained on the Test dataset on XAlign.

	Hindi	Marathi	Telugu	Tamil	English	Gujarati	Bengali	Kannada	Average
Baseline (fact translation)	2.71	2.04	0.95	1.68	1.01	0.64	2.73	0.45	1.53
GAT-Transformer	29.54	17.94	4.91	7.19	40.33	11.34	30.15	5.08	18.31
Vanilla Transformer	35.42	17.31	6.94	8.82	38.87	13.21	35.61	3.16	19.92
mT5-small	40.61	20.23	11.39	13.61	43.65	16.61	45.28	8.77	25.02

Contributors

Tushar Abhishek
Shivprasad Sagare
Bhavyajeet Singh
Anubhav Sharma
Manish Gupta
Vasudeva Varma

Citation

One can cite our paper as follows:

@article{abhishek2022xalign,
  title={XAlign: Cross-lingual Fact-to-Text Alignment and Generation for Low-Resource Languages},
  author={Abhishek, Tushar and Sagare, Shivprasad and Singh, Bhavyajeet and Sharma, Anubhav and Gupta, Manish and Varma, Vasudeva},
  journal={arXiv preprint arXiv:2202.00291},
  year={2022}
}

tushar117/XAlign