The code reporitory for the paper research paper published in the Transactions of the Association for Computational Linguistics (TACL).
Python version > 3.6.
Install required packages with pip:
python
pip install -r requirements.txt
# Install our adapted transformers package
cd t3l/transformers
pip install -e .
cd ../..
The XNLI and MultiEurlex datasets and the respective few-shot 10 and few-shot 100 data splits, as well as the TED talk translation corpus used in the paper for the fine-tuning of the MT-LM models of bg-en, el-en and sw-en can be found here.
Downloading MLdoc data requires an approval from TREC and to follow the script provided by the MLdoc paper authors to extract the benchmark dataset.
The following commands show how to fine-tune the different described in the paper.
python -m scripts.train_translation --train_data $TRAIN \
--validation_data $DEV \
--test_data $TEST \
--src $SRC \
--tgt $TGT \
--save_dir $OUT \
--tokenizer $PRETRAINED_LM \
--model_lm_path $PRETRAINED_LM \
--batch_size $BATCH_SIZE \
--grad_accum $GRAD_ACCUM \
--gpus $NUM_GPUS \
--epochs $EPOCHS
python -m scripts.train_lm --task XNLI \
--train_data $TRAIN \
--validation_data $DEV \
--test_data $TEST \
--save_dir $OUT \
--batch_size $BATCH_SIZE \
--grad_accum $GRAD_ACCUM \
--lr $LEARNING_RATE \
--gpus $NUM_GPUS \
--tokenizer $PRETRAINED_EN \
--model_lm_path $PRETRAINED_EN \
--epochs $EPOCH
python -m scripts.training_t3l --task XNLI \
--train_data $TRAIN \
--validation_data $DEV \
--test_data $TEST \
--src $SRC \
--tgt $TGT \
--save_dir $OUT \
--batch_size $BATCH_SIZE \
--grad_accum $GRAD_ACCUM \
--freeze_strategy fix_nmt_dec_lm_enc \
--gpus $NUM_GPUS \
--tokenizer $PRETRAINED_LM_EN \
--max_input_len $MAX_SEQ_LEN_INPUT \
--max_output_len $MAX_SEQ_LEN_OUTPUT \
--warmup $WARMUP \
--lr $LEARNING_RATE \
--model_lm_path $PRETRAINED_LM_EN \
--model_seq2seq_path $PRETRAINED_MT \
--epochs $EPOCHS
To perform inference (test) over the test sets with the trained
models, simply add the --test
flag and the --from_pretrained
argument
providing the path to the trained checkpoint.
Please cite our paper in your work:
@article{10.1162/tacl_a_00593,
author = {Jauregi Unanue, Inigo and Haffari, Gholamreza and Piccardi, Massimo},
title = "{T3L: Translate-and-Test Transfer Learning for Cross-Lingual Text Classification}",
journal = {Transactions of the Association for Computational Linguistics},
volume = {11},
pages = {1147-1161},
year = {2023},
month = {09},
abstract = "{Cross-lingual text classification leverages text classifiers trained in a high-resource language to perform text classification in other languages with no or minimal fine-tuning (zero/ few-shots cross-lingual transfer). Nowadays, cross-lingual text classifiers are typically built on large-scale, multilingual language models (LMs) pretrained on a variety of languages of interest. However, the performance of these models varies significantly across languages and classification tasks, suggesting that the superposition of the language modelling and classification tasks is not always effective. For this reason, in this paper we propose revisiting the classic “translate-and-test” pipeline to neatly separate the translation and classification stages. The proposed approach couples 1) a neural machine translator translating from the targeted language to a high-resource language, with 2) a text classifier trained in the high-resource language, but the neural machine translator generates “soft” translations to permit end-to-end backpropagation during fine-tuning of the pipeline. Extensive experiments have been carried out over three cross-lingual text classification datasets (XNLI, MLDoc, and MultiEURLEX), with the results showing that the proposed approach has significantly improved performance over a competitive baseline.}",
issn = {2307-387X},
doi = {10.1162/tacl_a_00593},
url = {https://doi.org/10.1162/tacl\_a\_00593},
eprint = {https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl\_a\_00593/2159097/tacl\_a\_00593.pdf},
}