/ss-cs-depparser

A semi-supervised deep dependency parser for code-switching dependency parsing

Primary LanguagePythonApache License 2.0Apache-2.0

Semi-Supervised CS Dependency Parser

This page includes source codes and trained models of the semi-supervised deep dependency parser described in our paper named "Improving Code-Switching Dependency Parsing with Semi-Supervised Auxiliary Tasks". The parser employs a semi-supervised learning approach "DCST" and utilizes auxiliary tasks for dependency parsing of code-switched (CS) language pairs. There are two versions of the parsing model, one is LSTM-based and the other is XLM-R-based. The following sections explain how to run these models. The trained models can be found here.

1. How-To-Run the LSTM-based Parser

Requirements

Run the following:

- pip install -r requirements.txt

Datasets

Word Embeddings

The LSTM-based models need pretrained word embeddings.

  • Download FastText embeddings from https://fasttext.cc/docs/en/crawl-vectors.html

    • In the paper, I used Dutch embeddings for Frisian-Dutch language pair, Hindi embeddings for Hindi-English, Russian embeddings for Komi-Zyrian, and Turkish embeddings for Turkish-German.
  • Unzip and locate them under data/multilingual_word_embeddings folder


Let's say we want to train the LSTM-based model with auxiliary task enhancements for the Turkish-German SAGT Treebank (qtd_sagt). As the unlabeled data, we use "TuGeBiC" (qtd_trde90).

  • Download the corpus, join all conll-u files and divide them to train and dev files. Name the training as "qtd_trde90-ud-train.conllu" and dev as "qtd_trde90-ud-dev.conllu". Locate these files under the folder data/datasets/UD_QTD-TRDE90/
  • Run the script:

    python utils/io_/convert_ud_to_onto_format.py --ud_data_path data/datasets

1.1. Train the baseline parser:
python examples/GraphParser.py --dataset ud --domain qtd_sagt --rnn_mode LSTM --num_epochs 150 --batch_size 16 --hidden_size 512 --arc_space 512 --arc_tag_space 128 --num_layers 3 --num_filters 100 --use_char --use_pos --word_dim 300 --char_dim 100 --pos_dim 100 --initializer xavier --opt adam --learning_rate 0.002 --decay_rate 0.5 --schedule 6 --clip 5.0 --gamma 0.0 --epsilon 1e-6 --p_rnn 0.33 0.33 --p_in 0.33 --p_out 0.33 --arc_decode mst --unk_replace 0.5 --punct_set '.' '``'  ':' ','  --word_embedding fasttext --word_path "data/multilingual_word_embeddings/cc.tr.300.vec" --char_embedding random --model_path saved_models/ud_parser_qtd_sagt_full_train
1.2. Parse the unlabeled data:
- python examples/GraphParser.py --dataset ud --domain qtd_trde90 --rnn_mode LSTM --num_epochs 150 --batch_size 16 --hidden_size 512 --arc_space 512 --arc_tag_space 128 --num_layers 3 --num_filters 100 --use_char --use_pos --word_dim 300 --char_dim 100 --pos_dim 100 --initializer xavier --opt adam --learning_rate 0.002 --decay_rate 0.5 --schedule 6 --clip 5.0 --gamma 0.0 --epsilon 1e-6 --p_rnn 0.33 0.33 --p_in 0.33 --p_out 0.33 --arc_decode mst --unk_replace 0.5 --punct_set '.' '``'  ':' ','  --word_embedding fasttext --word_path "data/multilingual_word_embeddings/cc.tr.300.vec" --char_embedding random --model_path saved_models/ud_parser_qtd_sagt_full_train --eval_mode --strict --load_path saved_models/ud_parser_qtd_sagt_full_train/domain_qtd_sagt.pt
1.3. Train sequence labelers:
Number of Children Task (NOC):
- python examples/SequenceTagger_for_DA.py --dataset ud --src_domain qtd_sagt --tgt_domain qtd_trde90 --task number_of_children --rnn_mode LSTM --num_epochs 100 --batch_size 16 --hidden_size 512 --tag_space 128 --num_layers 3 --num_filters 100 --use_char  --use_pos --char_dim 100 --pos_dim 100 --initializer xavier --opt adam --learning_rate 0.002 --decay_rate 0.5 --schedule 6 --clip 5.0 --gamma 0.0 --epsilon 1e-6 --p_rnn 0.33 0.33 --p_in 0.33 --p_out 0.33 --unk_replace 0.5 --punct_set '.' '``'  ':' ','  --word_embedding fasttext --word_path "data/multilingual_word_embeddings/cc.tr.300.vec" --char_embedding random --parser_path saved_models/ud_parser_qtd_sagt_full_train/ --use_unlabeled_data --model_path saved_models/ud_sequence_tagger_qtd_sagt_qtd_trde90_number_of_children_unlabeled/
Distance to the Root Task (DTR):
- python examples/SequenceTagger_for_DA.py --dataset ud --src_domain qtd_sagt --tgt_domain qtd_trde90 --task distance_from_the_root --rnn_mode LSTM --num_epochs 100 --batch_size 16 --hidden_size 512 --tag_space 128 --num_layers 3 --num_filters 100 --use_char  --use_pos --char_dim 100 --pos_dim 100 --initializer xavier --opt adam --learning_rate 0.002 --decay_rate 0.5 --schedule 6 --clip 5.0 --gamma 0.0 --epsilon 1e-6 --p_rnn 0.33 0.33 --p_in 0.33 --p_out 0.33 --unk_replace 0.5 --punct_set '.' '``'  ':' ','  --word_embedding fasttext --word_path "data/multilingual_word_embeddings/cc.tr.300.vec" --char_embedding random --parser_path saved_models/ud_parser_qtd_sagt_full_train/ --use_unlabeled_data --model_path saved_models/ud_sequence_tagger_qtd_sagt_qtd_trde90_distance_from_the_root_unlabeled/
Relative POS Encoding Task (RPE):
- python examples/SequenceTagger_for_DA.py --dataset ud --src_domain qtd_sagt --tgt_domain qtd_trde90 --task relative_pos_based --rnn_mode LSTM --num_epochs 100 --batch_size 16 --hidden_size 512 --tag_space 128 --num_layers 3 --num_filters 100 --use_char  --use_pos --char_dim 100 --pos_dim 100 --initializer xavier --opt adam --learning_rate 0.002 --decay_rate 0.5 --schedule 6 --clip 5.0 --gamma 0.0 --epsilon 1e-6 --p_rnn 0.33 0.33 --p_in 0.33 --p_out 0.33 --unk_replace 0.5 --punct_set '.' '``'  ':' ','  --word_embedding fasttext --word_path "data/multilingual_word_embeddings/cc.tr.300.vec" --char_embedding random --parser_path saved_models/ud_parser_qtd_sagt_full_train/ --use_unlabeled_data --model_path saved_models/ud_sequence_tagger_qtd_sagt_qtd_trde9_relative_pos_based_unlabeled/
Language ID of Head Task (LIH):
- python examples/SequenceTagger_for_DA.py --dataset ud --src_domain qtd_sagt --tgt_domain qtd_trde90 --task head_lang_ids --rnn_mode LSTM --num_epochs 100 --batch_size 16 --hidden_size 512 --tag_space 128 --num_layers 3 --num_filters 100 --use_char  --use_pos --char_dim 100 --pos_dim 100 --initializer xavier --opt adam --learning_rate 0.002 --decay_rate 0.5 --schedule 6 --clip 5.0 --gamma 0.0 --epsilon 1e-6 --p_rnn 0.33 0.33 --p_in 0.33 --p_out 0.33 --unk_replace 0.5 --punct_set '.' '``'  ':' ','  --word_embedding fasttext --word_path "data/multilingual_word_embeddings/cc.tr.300.vec" --char_embedding random --parser_path saved_models/ud_parser_qtd_sagt_full_train/ --use_unlabeled_data --model_path saved_models/ud_sequence_tagger_qtd_sagt_qtd_trde9_head_lang_ids_unlabeled/
Simplified Morphology of Head Task (SMH):
- python examples/SequenceTagger_for_DA.py --dataset ud --src_domain qtd_sagt --tgt_domain qtd_trde90 --task head_simplified_morp_feats --rnn_mode LSTM --num_epochs 100 --batch_size 16 --hidden_size 512 --tag_space 128 --num_layers 3 --num_filters 100 --use_char  --use_pos --char_dim 100 --pos_dim 100 --initializer xavier --opt adam --learning_rate 0.002 --decay_rate 0.5 --schedule 6 --clip 5.0 --gamma 0.0 --epsilon 1e-6 --p_rnn 0.33 0.33 --p_in 0.33 --p_out 0.33 --unk_replace 0.5 --punct_set '.' '``'  ':' ','  --word_embedding fasttext --word_path "data/multilingual_word_embeddings/cc.tr.300.vec" --char_embedding random --parser_path saved_models/ud_parser_qtd_sagt_full_train/ --use_unlabeled_data --model_path saved_models/ud_sequence_tagger_qtd_sagt_qtd_trde9_head_simplified_morp_feats_unlabeled/
Punctuation Count Task (PC):
- python examples/SequenceTagger_for_DA.py --dataset ud --src_domain qtd_sagt --tgt_domain qtd_trde90 --task count_punct --rnn_mode LSTM --num_epochs 100 --batch_size 16 --hidden_size 512 --tag_space 128 --num_layers 3 --num_filters 100 --use_char  --use_pos --char_dim 100 --pos_dim 100 --initializer xavier --opt adam --learning_rate 0.002 --decay_rate 0.5 --schedule 6 --clip 5.0 --gamma 0.0 --epsilon 1e-6 --p_rnn 0.33 0.33 --p_in 0.33 --p_out 0.33 --unk_replace 0.5 --punct_set '.' '``'  ':' ','  --word_embedding fasttext --word_path "data/multilingual_word_embeddings/cc.tr.300.vec" --char_embedding random --parser_path saved_models/ud_parser_qtd_sagt_full_train/ --use_unlabeled_data --model_path saved_models/ud_sequence_tagger_qtd_sagt_qtd_trde9_count_punct_unlabeled/
1.4. Train the final model:

Now that we trained all of the auxiliary task enhancement models, we can train the final model. Let's say we want to train the best performing model in the paper for Turkish-German SAGT Treebank: +RPE,+LIH,+SMH which stands for the ensemble of RPE, LIH, and SMH tasks.

- python examples/GraphParser_for_DA.py --dataset ud --src_domain qtd_sagt --tgt_domain qtd_trde90 --rnn_mode LSTM --num_epochs 150 --batch_size 16 --hidden_size 512 --arc_space 512 --arc_tag_space 128 --num_layers 3 --num_filters 100 --use_char --use_pos  --word_dim 300 --char_dim 100 --pos_dim 100 --initializer xavier --opt adam --learning_rate 0.002 --decay_rate 0.5 --schedule 6 --clip 5.0 --gamma 0.0 --epsilon 1e-6 --p_rnn 0.33 0.33 --p_in 0.33 --p_out 0.33 --arc_decode mst --unk_replace 0.5 --punct_set '.' '``'  ':' ','  --word_embedding fasttext --word_path "data/multilingual_word_embeddings/cc.tr.300.vec" --char_embedding random --gating --num_gates 4 --load_sequence_taggers_paths saved_models/ud_sequence_tagger_qtd_sagt_qtd_trde90_relative_pos_based_unlabeled/src_domain_qtd_sagt_tgt_domain_qtd_trde90.pt saved_models/ud_sequence_tagger_qtd_sagt_qtd_trde90_head_of_lang_ids_unlabeled/src_domain_qtd_sagt_tgt_domain_qtd_trde90.pt saved_models/ud_sequence_tagger_qtd_sagt_qtd_trde90_head_simplified_morp_feats_unlabeled/src_domain_qtd_sagt_tgt_domain_qtd_trde90.pt --model_path saved_models/ud_parser_qtd_sagt_qtd_trde90_ensemble_gating_RPE_LIH_SMH/

If you want to join only one seq_labeler model (e.g., only +NOC model), set --num_gates to 2 and provide only the trained model of that seq_labeler model.



2. How-To-Run the XLM-R-based Parser:

Requirements

Create a conda environment using the environment.yml file:

  • conda env create -f XLM-R-based/auxiliary-task-train/steps_parser/environment.yml

Activate the environment:

  • conda activate ss_cs_depparse

Pretrained Language Model

Download XLM-R base model from Hugging Face and locate it under XLM-R-based/dcst-parser-train/pretrained_model/.



2.1. Example Run using the Trained Models on Turkish-German code-switching (QTD_SAGT) Treebank:

2.1.1. Datasets

  • For labeled data, we use the QTD_SAGT dataset from Universal Dependencies:
    • Download QTD_SAGT treebank and locate it under LSTM-based/DCST/data/datasets/
  • For unlabeled data, we use "TuGeBiC".
    • Download the corpus, join all conll-u files and divide them to train and dev files. Name the training as "qtd_trde90-ud-train_autoparsed.conllu" and dev as "qtd_trde90-ud-dev_autoparsed.conllu". Locate these files under XLM-R-based/auxiliary-task-train/preprocessed_unlabeled_data/

2.1.2. Preprocess Unlabeled Data

Navigate to XLM-R-based/auxiliary-task-train/preprocessed_unlabeled_data/

Run the corresponding Python script for the auxiliary task you want to use. E.g., for the LIH task:

- python dcst_langid_of_head.py qtd_trde90-ud-train_autoparsed.py qtd_trde90-ud-train_autoparsed_lih.py
- python dcst_langid_of_head.py qtd_trde90-ud-dev_autoparsed.py qtd_trde90-ud-dev_autoparsed_lih.py

2.1.3. Trained Models

Download the trained models from the Trained_Models_XLM-R folder. Locate parser_models under XLM-R-based/dcst-parser-train/trained_models/ and auxiliary_task_models under XLM-R-based/auxiliary-task-train/trained_models/


2.1.4. Use the Trained Model to Parse QTD_SAGT:

Let's say we want to use the +LIH model for Tr-De CS pair (QTD_SAGT).

- cd XLM-R-based/dcst-parser-train/steps_parser/

- python src/train.py ../deps_lih_qtd.json


2.2 Another Example Run using the Trained Models on monolingual Turkish IMST Treebank:

Here, we show how to run the XLM-R-based model trained with the SMH (simplified morphology of head) task on the TR_IMST Treebank.

2.2.1. Datasets:

- Download TR_IMST Treebank from Universal Dependencies and locate it under **LSTM-based/DCST/data/datasets/**
- For unlabeled data, we use TR_BOUN Treebank. Download TR_BOUN Treebank from Universal Dependencies and locate it under **LSTM-based/DCST/data/datasets/**

(NOTE: You can skip the following two steps (2.2.2. and 2.2.3.) if you use the TR_BOUN Treebank as the unlabeled data. To ease the process, we have already put the pseudo-labeled data under XLM-R-based/auxiliary-task-train/preprocessed_unlabeled_data/ (as tr_boun-ud-train-parsedbyimst-smh.conllu and tr_boun-ud-dev-parsedbyimst-smh.conllu))

2.2.2. Pseudo-label the Unlabeled Data with the Base Parser:

Since the annotations of the TR_BOUN Treebank are gold annotations, we need to re-label them automatically for our purposes. You can train the base parser and parse the treebank using this trained model and save the train and dev files as "tr_boun-ud-train_autoparsed.conllu" and "tr_boun-ud-dev_autoparsed.conllu". Locate these files under XLM-R-based/auxiliary-task-train/preprocessed_unlabeled_data/.

2.2.3. Preprocess Unlabeled Data

Navigate to XLM-R-based/auxiliary-task-train/preprocessed_unlabeled_data/

Run the corresponding Python script for the auxiliary task you want to use. E.g., for the SMH task:

- python dcst_simplified_morp_of_head.py tr_boun-ud-train_autoparsed.py tr_boun-ud-train-parsedbyimst-smh.py
- python dcst_simplified_morp_of_head.py tr_boun-ud-dev_autoparsed.py tr_boun-ud-dev-parsedbyimst-smh.py

2.2.4. Trained Models

Download the trained models from the Trained_Models_XLM-R folder. Locate parser_models under XLM-R-based/dcst-parser-train/trained_models/ and auxiliary_task_models under XLM-R-based/auxiliary-task-train/trained_models/


2.2.5. Use the Trained Model to Parse TR_IMST:

We will use the +SMH model for parsing of TR_IMST dataset.

- cd XLM-R-based/dcst-parser-train/steps_parser/

- python src/train.py ../deps_smh_imst.json