This repository contains the code and data for the paper "Turning English-centric LLMs Into Polyglots: How Much Multilinguality Is Needed?"
To setup the environment, we recommend using conda, e.g.:
conda create -n ml_llm -c conda-forge python=3.10 cudatoolkit=11.8 -y
conda activate ml_llm
pip install vllm==0.2.1
pip install -r requirements.txt
Download model used for language detection to resources/lid/
mkdir resources
wget https://data.statmt.org/lid/lid201-model.bin.gz -P resources/lid/
gzip -d resources/lid/lid201-model.bin.gz
For evaluations using Eleuther AI's LM Evaluation Harness, run:
git clone git@github.com:EleutherAI/lm-evaluation-harness.git
git reset --hard 3ccea2b2
pip install -e ".[multilingual]"
If running experiments with OpenAI's API-based models, create a file containing your API key, e.g.:
echo "OPENAI_API_KEY = 'YOUR_OPENAI_API_KEY'" > api_secrets.py
The formatted datasets and finetuned models used in our experiments are available on the Hugging Face Hub: https://huggingface.co/collections/tannonk/multilingual-instruction-tuning-65855e8d92eba5ad69df4b2a
The data used for our experiments is available in data and can be prepared from the original sources using the scripts here.
To train a model on a given dataset, use the script sft_training.py
. For example:
CUDA_VISIBLE_DEVICES=2,3 nohup python sft_training.py \
--model_name_or_path "meta-llama/Llama-2-7b-hf" \
--train_dataset "data/guanaco/guanaco_train_ml2.json" \
--eval_dataset "data/guanaco/guanaco_test.json" \
--output_dir "resources/models/llama_2_7b_hf_ml2" \
--num_train_epochs 10 \
--per_device_train_batch_size 4 --per_device_eval_batch_size 4 --gradient_accumulation_steps 4 \
--log_with "wandb" >| resources/models/logs/llama_2_7b_hf_ml2.log &
Once training is completed, we merge the learned adapters with the base model for easy loading with vLLM.
python merge_peft_adapter.py \
--adapter_model_name_or_path "resources/models/llama_2_7b_hf_ml2" \
--output_dir "resources/models/llama_2_7b_hf_ml2_merged"
To run inference for the different tasks, you can use the appropriate run_inference*.sh
script (here), specifying the GPU device ID, model directories and evaluation datasets.
bash scripts/run_alpaca_inference.sh \
-d 0 \
-m resources/models/llama_2_7b_hf_ml2_merged resources/models/llama_2_7b_hf_ml3_merged \
-t data/alpaca_eval/alpaca_eval_instructions_is.json data/alpaca_eval/alpaca_eval_instructions_el.json data/alpaca_eval/alpaca_eval_instructions_hi.json
bash scripts/run_xcsqa_inference.sh \
-d 0 \
-m resources/models/llama_2_7b_hf_ml2_merged resources/models/llama_2_7b_hf_ml3_merged \
-t data/xcsqa/xcsqa_dev_zh_zh.json data/xcsqa/xcsqa_dev_fr_fr.json
bash scripts/run_xnli_inference.sh \
-d 0 \
-m resources/models/llama_2_7b_hf_ml2_merged resources/models/llama_2_7b_hf_ml3_merged \
-t data/xquad/xquad_dev_en_hi.json data/xquad/xquad_dev_hi_hi.json
nohup bash scripts/run_lm_eval_harness.sh 0 resources/models/llama_2_7b_hf_ml2_merged >| logs/llama_2_7b_hf_ml2_merged.log &
The script run_llm_judge.sh, can be used to evaluate chat responses given multiple models and target languages. E.g.:
bash scripts/run_llm_judge.sh \
-m data/alpaca_eval_outputs/llama_2_7b_hf_ml2_merged data/alpaca_eval_outputs/llama_2_7b_hf_ml3_merged \
-l is el hi
Plots used in the paper can be generated using this notebook.
@misc{kew2023turning,
title={Turning English-centric LLMs Into Polyglots: How Much Multilinguality Is Needed?},
author={Tannon Kew and Florian Schottmann and Rico Sennrich},
year={2023},
eprint={2312.12683},
archivePrefix={arXiv},
primaryClass={cs.CL}
}