Official code for the paper "Deep Contextualized Self-training for Low Resource Dependency Parsing".
If you use this code please cite our paper.
Simply run:
- Python 3.7
- Pytorch 1.1.0
- Cuda 10.0
pip install -r requirements.txt
Preprocessed in note
format. Data folder can be obtained from here.
Embeddings can be found here
Possible word embedding option: ['random', 'fasttext']
The multilingual word embedding (.vec extensions) should be placed under the data/multilingual_word_embeddings
folder.
In order to run the low resource in-domain experiments there are three steps we need to follow:
- Running the base Biaffine parser
- Running the sequence tagger(s)
- Running the combined DCST parser
If you want to run complete model then simply run bash script run_dcsh.sh
otherwise
Refer to corrsoponding section in run_dcsh.sh
to run corrsopnding segments.
- Without POS Tag : Don't use flag
--use_pos
for all stages, namely, base model, auxiliary tasks, Final ensembled model. - With Coarse level Tag : Use the input files from
data
folder from with--use_pos
flag here - With POS level Tag : Shuffle 2nd and 3rd column of all the files in
data
folder.
Use average of FastText, case-layer
and nos-layer
hidden representation as embedding.
Set the word_path="./data/cc.sanskrit.300.case.nos.vec"
or cc.sanskrit.300.FT.case.nos.vec
or cc.sanskrit.300.case.vec
. These files can be found here
Note that to run BiAFF classifier on 500 training data set --set_num_training_samples 500
. And if you want to train on complete trainind data remove this flag.
Refer to corrsoponding section in run_dcsh.sh
Once training the base parser, we can now run the Sequnece Tagger on any of the three proposed sequence tagging tasks in order to learn the syntactical contextualized word embeddings from the unlabeled data set. \
- For Auxiliary task set tasks as : 'number_of_children' 'relative_pos_based' 'distance_from_the_root'
- For Multitask setting set tasks : 'Multitask_case_predict' 'Multitask_POS_predict' 'Multitask_label_predict'
Refer to corrsoponding section in run_dcsh.sh
As a final step we can now run the DCST (ensemble) parser:
Refer to corrsoponding section in run_dcsh.sh