Implementation of the translation pipeline, automatic sampling and scoring,human evaluation and experiments in our EMNLP 2022 paper: IndicXNLI: Evaluating Multilingual Inference for Indian Languages. To explore the dataset online visit dataset page.
@inproceedings{aggarwal-etal-2022-indicxnli,
title = "{I}ndic{XNLI}: Evaluating Multilingual Inference for {I}ndian Languages",
author = "Aggarwal, Divyanshu and
Gupta, Vivek and
Kunchukuttan, Anoop",
booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
month = dec,
year = "2022",
address = "Abu Dhabi, United Arab Emirates",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.emnlp-main.755",
pages = "10994--11006",
abstract = "While Indic NLP has made rapid advances recently in terms of the availability of corpora and pre-trained models, benchmark datasets on standard NLU tasks are limited. To this end, we introduce INDICXNLI, an NLI dataset for 11 Indic languages. It has been created by high-quality machine translation of the original English XNLI dataset and our analysis attests to the quality of INDICXNLI. By finetuning different pre-trained LMs on this INDICXNLI, we analyze various cross-lingual transfer techniques with respect to the impact of the choice of language models, languages, multi-linguality, mix-language input, etc. These experiments provide us with useful insights into the behaviour of pre-trained models for a diverse set of languages.",
}
Below are the details about the IndicXNLI datasets and scripts for reproducing the results reported in the EMNLP 2022 Paper.
The code requires python 3.7+
Clone this repository on your machine - git clone https://github.com/divyanshuaggarwal/IndicXNLI.git
create a conda environment using conda create --name indicxnli python=3.7
command and activate it using conda activate indicxnli
Install requirements by typing the following command-
pip install -r requirements.txt
Download and unpack the IndicXNLI dataset into ./data
in the main IndicXNLI
folder.
Carefully read the LICENCE for non-academic usage.
run the following the commands from the root of the directory
cd data
git lfs install
git clone https://huggingface.co/datasets/Divyanshu/indicxnli
After downloading, you'll have multiple sub-folders with several json files. Each json file in the sub-folders is a list of json objects with premise, hypothesis and label key. The folder structure will be as follows:
data
├── backward
│ ├── dev
│ │ ├── xnli_as.json
│ │ ├── xnli_bn.json
│ │ ├── xnli_gu.json
│ │ ├── xnli_hi.json
│ │ ├── xnli_kn.json
│ │ ├── xnli_ml.json
│ │ ├── xnli_mr.json
│ │ ├── xnli_or.json
│ │ ├── xnli_pa.json
│ │ ├── xnli_ta.json
│ │ └── xnli_te.json
│ └── test
│ ├── xnli_as.json
│ ├── xnli_bn.json
│ ├── xnli_gu.json
│ ├── xnli_hi.json
│ ├── xnli_kn.json
│ ├── xnli_ml.json
│ ├── xnli_mr.json
│ ├── xnli_or.json
│ ├── xnli_pa.json
│ ├── xnli_ta.json
│ └── xnli_te.json
└── forward
├── dev
│ ├── xnli_as.json
│ ├── xnli_bn.json
│ ├── xnli_gu.json
│ ├── xnli_hi.json
│ ├── xnli_kn.json
│ ├── xnli_ml.json
│ ├── xnli_mr.json
│ ├── xnli_or.json
│ ├── xnli_pa.json
│ ├── xnli_ta.json
│ └── xnli_te.json
├── test
│ ├── xnli_as.json
│ ├── xnli_bn.json
│ ├── xnli_gu.json
│ ├── xnli_hi.json
│ ├── xnli_kn.json
│ ├── xnli_ml.json
│ ├── xnli_mr.json
│ ├── xnli_or.json
│ ├── xnli_pa.json
│ ├── xnli_ta.json
│ └── xnli_te.json
└── train
├── xnli_as.json
├── xnli_bn.json
├── xnli_gu.json
├── xnli_hi.json
├── xnli_kn.json
├── xnli_ml.json
├── xnli_mr.json
├── xnli_or.json
├── xnli_pa.json
├── xnli_ta.json
└── xnli_te.json
7 directories, 55 files
data/forward
and data/backward
will be the primary datasets folders to work on here. data/forward
folder contains en-->in
translations of premise and hypothesis and data/backward
folder contain in-->en
back translations from the data/forward
folder.
The original english XNLI Dataset is translated using the IndicTrans translation model from AI4Bharat.
To install IndicTrans run bash setup_indic_trans.sh
in the terminal.
You can then start creating the dataset by running python src/make_dataset.py
.This will create the dataset in the data folder by the above mentioned tree format.
Note: Before running any python script, first run export PYTHONPATH=$PWD
in the terminal.
To sample out the examples fron test dataset using dpp run python src/sampling.py
in the terminal. This will create the necessary files in the ./sampled_data
directory:
sampled_data
├── samples_as.csv
├── samples_gu.csv
├── samples_or.csv
├── samples_kn.csv
├── samples_bn.csv
├── samples_hi.csv
├── samples_pa.csv
├── samples_mr.csv
├── samples_ml.csv
├── samples_ta.csv
└── samples_te.csv
These datasets are then copied to ./human_evaluation_data/
for human annotators to evaluate the translation quality.
We calculate BertScore on the original test data and backtranslated test data. This allows us to identify the upper bound error. to run the python scripts run python src/generate_automatic_scores.py
which will populate the ./automatic_scores/
folder in the format test_set_{language code}_avg.json
like mentioned below:
automatic_score
├── test_set_as_avg.json
├── test_set_gu_avg.json
├── test_set_kn_avg.json
├── test_set_pa_avg.json
├── test_set_mr_avg.json
├── test_set_ml_avg.json
├── test_set_hi_avg.json
├── test_set_bn_avg.json
├── test_set_ta_avg.json
├── test_set_te_avg.json
└── test_set_or_avg.json
The json file will of format, where key is the bertscore of original vs the indic trans or google trans and value is the average bertscore across all test samples.For examples refer below:
{
"bertscore f1 original vs back translated": 0.9366865618024282,
"bertscore f1 original vs back translated googletrans": 0.9743754622941008,
"bertscore precision original vs back translated": 0.931326547413767,
"bertscore precision original vs back translated googletrans": 0.9859856570433237,
"bertscore recall original vs back translated": 0.9423049956560134,
"bertscore recall original vs back translated googletrans": 0.9659530529480971,
}
The script will also generate sentence-wise BertScore .csv
file for every language as well in the ./automatic_scores
folder.
The Expirements were run on google colab TPUs. To setup TPU environment run the following in the terminal:
bash setup_tpu.sh
also uncomment line 17 in src/trainer.py
file.
To run the baselines type the following in the terminal:
python main.py
this will create the result json files in the ./results/
folder.
a typical result json file looks like following:
xnli_experiments_{model name}_{baseline suffix code}.json
in a typical json file the data is organised in key value pairs where key is the language code and the value is the accuracy on that language for example:
{
"as": 0.72,
"gu": 0.85,
"kn": 0.75,
....
}
the baselines are named as follows:
Baseline as Named in Paper | Baselines suffix codes for JSON file |
---|---|
Indic Train | in |
English Train | en |
Eglish Eval | back |
English + Indic Train | 2_step |
Train all | n_step |
Crosslingual Transfer | {language code for which the model is trained on for ex. as,gu etc}_in |
En-indic | - |
English + Indic Train | 2_step_hypo |
Train all | n_step_hypo |