Clone the github repository.
git clone https://github.com/ffaisal93/DialectBench.io.git
cd DialectBench
- Download all data available
[except mt and the ones loadable through huggingface]
bash download_data.sh --task all
- Download data for Turkish dialectal machine translation
bash download_data.sh --task machine_translation_turkish
Dependency parsing:
Install Adapter Package
bash install.sh --task install_adapter
Extractive Question Answering [SDQA]:
Install Transformers 3.4.0
bash install.sh --task install_transformers_qa
Other Structured Prediction, QA and Classification tasks:
Transformers 4.21.1
bash install.sh --task install_transformers
- Finetune all available language-specific models on both pretrained mBERT and XLMR at once
./all_commands.sh --action train_udp --execute bash
- Finetune one single available language-specific model
bash install.sh --task train_udp --lang UD_English-EWT --MODEL_NAME mbert
- Prediction on all finetuned model (for both pretrained mBERT and XLMR) and if no training data available for a specific language variety, do zeroshot from English variety
"UD_English-EWT"
./all_commands.sh --action predict_udp --execute bash
- Do zero-shot prediction from a specific language variety (e.g.
UD_English-EWT
) and on all available variety defined in--lang_config metadata/udp_metadata.json
bash install.sh --task predict_udp_zeroshot_all --lang UD_English-EWT --MODEL_NAME mbert
- Do test data prediction on a single finetuned language variety (e.g.
UD_English-EWT
)bash install.sh --task predict_udp_single --lang UD_English-EWT --MODEL_NAME mbert
- Finetune all available language-specific models on both pretrained mBERT and XLMR at once
./all_commands.sh --action train_pos --execute bash
- Prediction on all finetuned model (for both pretrained mBERT and XLMR) and if no training data available for a specific language variety, do zeroshot from English variety
"UD_English-EWT"
./all_commands.sh --action predict_pos --execute bash
- Performing in-variety Finetuning on all available language varieties on both pretrained mBERT and XLMR at one go.
./all_commands.sh --action train_pos --execute bash
- or, If you want to performing in-variety finetuning for a single language only, try the following:
bash install.sh --task train_ner --lang bokmaal --MODEL_NAME bert --dataset wikiann
- We have two datasets supported in DialectBench at this point.
wikiann
andnorwegian_ner
.-
wikiann
: language varieties("ar" "az" "ku" "tr" "hsb" "nl" "fr" "zh" "en" "mhr" "it" "de" "pa" "es" "hr" "lv" "hi" "ro" "el" "bn")
. Use--dataset wikiann
to finetune varieties from this dataset. -
norwegian_ner
: language varieties ("bokmaal" "nynorsk" "samnorsk"). Use--dataset scripts/ner/norwegian_ner.py
to finetune varieties from this dataset.
-
- Prediction using all in-variety finetuned models (for both pretrained mBERT and XLMR) as well as performing zeroshot prediction using English variety
en
on the varieties available in--lang_config metadata/metadata/ner_metadata.json
at one go../all_commands.sh --action predict_ner --execute bash
- Performing In-cluster finetuning (on both pretrained mbert and xlm-r) on selected varieties from different language cluster.
./all_commands.sh --action train_topic_classification_lm --execute bash
-
Add or remove specific variety for finetuning from SIB-200 dataset here in
command-bash.sh
file.if [[ "$task" = "train_topic_classification_lm" || "$task" = "predict_topic_classification_lm" ]]; then export ALL_LANGS=("eng_Latn" "ita_Latn" "azj_Latn" "ckb_Arab" "nob_Latn" "nld_Latn" "lvs_Latn" "arb_Arab" "lij_Latn" "zho_Hans" "spa_Latn" "nso_Latn")
- Performing inference on all available varieties across different language clusters (as defined in
--lang_config metadata/topic_metadata.json
) and on top of different pretrained models (mbert, xlmr)
./all_commands.sh --action predict_topic_classification_lm --execute bash
- Performing zero-shot finetuning from English (on top of both pretrained mbert and xlm-r) on selected varieties from different language cluster.
./all_commands.sh --action train_nli --execute bash
-
Add or remove specific variety for finetuning from translate-test
dialect_nli
dataset here incommand-bash.sh
file.if [[ "$task" = "train_nli" || "$task" = "predict_nli" ]]; then # export ALL_LANGS=("eng_Latn" "ita_Latn" "azj_Latn" "ckb_Arab" "nob_Latn" "nld_Latn" "lvs_Latn" "arb_Arab" "lij_Latn" "zho_Hans" "spa_Latn" "nso_Latn" "ben_Beng") export ALL_LANGS=("eng_Latn") for lang in "${ALL_LANGS[@]}"; do echo ${base_model} echo ${lang} echo ${dataset} bash install.sh --task ${task} --lang ${lang} --MODEL_NAME ${base_model} done fi
-
dialect_nli
dataset loading script:--dataset_script scripts/nli/dialect_nli.py
- Performing inference on all available varieties across different language clusters (as defined in
--lang_config metadata/nli_metadata.json
) and on top of different pretrained models (mbert, xlmr).
./all_commands.sh --action predict_nli --execute bash
- At this point, DialectBench only supports arabic dialectal sentiment analysis. To finetune variety-specific models:
./all_commands.sh --action train_sa --execute bash
- To evaluate each variety-specific model at one go:
./all_commands.sh --action predict_sa --execute bash
-
Add or remove specific variety for finetuning in
command-bash.sh
file.if [[ "$task" = "train_sa" || "$task" = "predict_sa" ]]; then export ALL_LANGS=("aeb_Arab" "aeb_Latn" "arb_arab" "ar-lb" "arq_arab" "ary_arab" "arz_arab" "jor_arab" "sau_arab") for lang in "${ALL_LANGS[@]}"; do echo ${base_model} echo ${lang} echo ${dataset} bash install.sh --task ${task} --lang ${lang} --lang2 arabic --MODEL_NAME ${base_model} done fi
- Finetune Arabic, English, Mandarin, Portuguese, Spanish and Swiss-Dialect identification models (mbert and xlmr based)
./all_commands.sh --action train_did --execute bash
- Finetune a dialect identification model of a single language
export lang="arabic" #"arabic" english" "greek" "mandarin_simplified" "mandarin_traditional" "portuguese" "spanish" "swiss-dialects"
export base_model="mbert" #"mbert" "xlmr"
bash install.sh --task train_did --lang ${lang} --dataset ${dataset} --MODEL_NAME ${base_model}
./all_commands.sh --action predict_did_lm --execute bash
./all_commands.sh --action train_reading_comprehension --execute bash
./all_commands.sh --action predict_reading_comprehension --execute bash
- Finetune on all language at once as well as on singlae language and it's varieties.
./all_commands.sh --action train_sdqa --execute bash
- Add or remove specific language cluster in this
command-bash.sh
block.
f [[ "$task" = "train_sdqa" || "$task" = "predict_sdqa" ]]; then
export ALL_MODELS=("all" "arabic" "bengali" "english" "finnish" "indonesian" "korean" "russian" "swahili" "telugu")
for MODEL_NAME in "${ALL_MODELS[@]}"; do
echo ${base_model}
echo ${MODEL_NAME}
bash install.sh --task ${task} --lang ${MODEL_NAME} --MODEL_NAME ${base_model} --dataset dev
bash install.sh --task ${task} --lang ${MODEL_NAME} --MODEL_NAME ${base_model} --dataset test
done
fi
./all_commands.sh --action predict_sdqa --execute bash