/disco2labels

Discontinuous Constituent Parsing as Sequence Labeling - EMNLP 2020 repository

Primary LanguagePythonMIT LicenseMIT

disco2labels

Discontinuous Constituent Parsing as Sequence Labeling - EMNLP 2020 repository

Requirements

  • Ubuntu 18.04
  • discodop
  • Python 3.6+
  • nltk 3.4.5
  • pytorch 1.2.0
  • transformers 2.5.1
  • scikit-learn 0.21.3

Installation

  • Create a virtual environment: virtualenv --python python3.6 $HOME/env/disco2labels

  • Activate the virtual enviroment: source $HOME/env/disco2labels/bin/activate

  • To install the dependencies: pip install -r requirements.txt

  • To install discodop follow these instructions (used to evaluate the models and optionally for model selection).

  • To install the resources used in this paper (e.g. embeddings, PoS taggers or templates for training configurations) execute sh download.sh.

  • We released as well a few pretrained parsing models, check the pretrained parsing models section.

Encoding a treebank

NOTE: Currently, only the discbracket format is supported as input format for the conversion.

cd disco2labels
python encode.py \
--train data/negra/train.discbracket \
--dev data/negra/dev.discbracket \
--test data/negra/test.discbracket \
--output data/negra_sl/pos-pointer/ \
--root_label \
--os \
--disc \
--split_char '{}' \
--disco_encoder pos-pointer \
--check_decode

The output will be three files to be stored at the previously created directory data/negra_sl/pos-pointer/: train.tsv, dev.tsv and test.tsv. They are composed of three columns: word, postags and labels. This format will be used to train and run the models too.

The options to encode a treebank (--disco_encoder) are abs-idx|rel-idx|lehmer|lehmer-inverse|pos-pointer|pos-pointer-reduced. Check the paper for details on the specifics for each encoding.

To encode the treebank with the strategy pos-pointer-reduced you also need to specify the parameter --path_reduced_tagset, i.e. --path_reduced_tagset resources/tagset_reduction_tiger_negra.txt (for the NEGRA and TIGER German treebanks) or --path_reduced_tagset resources/tagset_reduction_dptb.txt (for the DPTB English treebank)

To check all the parameter options: python encode.py --help

Decoding a linearized output

NOTE For simplicity, we use here a gold encoded file, but the same applies to predicted output files generated by a model.

cd disco2labels
python decode.py \
--input data/negra_sl/pos-pointer/train.tsv \
--output /tmp/train_decoded.tsv \
--disc \
--disco_encoder pos-pointer \
--split_char {} \
--os 

To check all the parameter options: python decode.py --help

Training a model

NCRFpp

We used a modified version of the NCRFpp package that we include as a part of this repository:

cd disco2labels
python NCRF/main.py --config resources/ncrfpp_confs/train.negra.pos-pointer.bilstm.config

NOTE: To correctly train a model, please check the template at resources/ncrfpp_confs/train.negra.pos-pointer.bilstm.config and verify whether you need to adapt the paths to the location of the data and resources in your computer.

BERT/DistilBERT

We adapted a script released initially by huggingface🤗 to train BERT-based models for discontinuous constituent parsing as sequence labeling.

cd disco2labels

DistilBERT

CUDA_VISIBLE_DEVICES=0 python run_token_classifier.py \
--data_dir  data/negra_sl/pos-pointer/ \
--transformer_model distilbert_model \
--transformer_pretrained_model  distilbert-base-german-cased \
--task_name sl_tsv \
--model_dir /tmp/negra.pos-pointer.distilbert-base-german-cased.model \
--output_dir /tmp/negra.pos-pointer.distilbert-base-german-cased.output \
--path_gold_parenthesized data/negra/dev.discbracket \
--evalb_param proper.prm \
--label_split_char {} \
--disco_encoder pos-pointer \
--log /tmp/negra.pos-pointer.distilbert-base-german-cased.log \
--learning_rate 1e-5 \
--parsing_paradigm constituency --do_train --do_eval --num_train_epochs 45 --train_batch_size 6 --max_seq_length 240 

BERT

CUDA_VISIBLE_DEVICES=0 python run_token_classifier.py \
--data_dir  data/negra_sl/pos-pointer/ \
--transformer_model bert_model \
--transformer_pretrained_model bert-base-german-dbmdz-cased \
--task_name sl_tsv \
--model_dir /tmp/negra.pos-pointer.bert-base-german-dbmdz-cased.model \
--output_dir /tmp/negra.pos-pointer.bert-base-german-dbmdz-cased.output \
--path_gold_parenthesized data/negra/dev.discbracket \
--evalb_param proper.prm \
--label_split_char {} \
--disco_encoder pos-pointer \
--log /tmp/negra.pos-pointer.bert-base-german-dbmdz-cased.log \
--learning_rate 1e-5 \
--parsing_paradigm constituency --do_train --do_eval --num_train_epochs 45 --train_batch_size 6 --max_seq_length 240

Some relevant options:

  • --transformer_model: bert_model|distilbert_model
  • --transformer_pretrained_model: bert-base-german-dbmdz-cased|distilbert-base-german-cased (for German) bert-base-cased|bert-large-cased|distilbert-base-cased (for English)
  • --path_reduced_tagset: Required when training a model using the pos-pointer-reduced strategy

To check all the options: python run_token_classifier.py --help

On using uncased models

You will need to specify a bert-base-uncased model (e.g. bert-base-german-dbmdz-uncased) and also you will specify the parameter option --do_lower_case.

DistilBERT

CUDA_VISIBLE_DEVICES=0 python run_token_classifier.py \
--data_dir  data/negra_sl/pos-pointer/ \
--transformer_model bert_model \
--transformer_pretrained_model bert-base-german-dbmdz-uncased \
--task_name sl_tsv \
--model_dir /tmp/negra.pos-pointer.bert-base-german-dbmdz-uncased.model \
--output_dir /tmp/negra.pos-pointer.bert-base-german-dbmdz-uncased.output \
--path_gold_parenthesized data/negra/test.discbracket \
--evalb_param proper.prm \
--label_split_char {} \
--disco_encoder pos-pointer \
--log /tmp/negra.pos-pointer.bert-base-german-dbmdz-uncased.log \
--learning_rate 1e-5 \
--parsing_paradigm constituency \
--do_test --num_train_epochs 45 --train_batch_size 6 --max_seq_length 240 --do_lower_case

Running and evaluating a model

NCRFpp

taskset --cpu-list 1 \
python run_ncrfpp.py \
--test data/negra_sl/pos-pointer/test.tsv \
--gold data/negra/test.discbracket \
--model /tmp/ncrfpp.bilstm.negra.pos-pointer \
--gpu True \
--output /tmp/ncrfpp.bilstm.negra.pos-pointer \
--disco_encoder pos-pointer \
--evalb_param proper.prm \
--os \
--ncrfpp NCRF 

To check all the parameter options: python run_ncrfpp.py --help

Alternatively, if you simply want to run the model, you can create a decoding config file (check the template resources/ncrfpp_confs/decode.negra.pos-pointer.bilstm.config):

python NCRF/main.py --config resources/ncrfpp_confs/decode.negra.pos-pointer.bilstm.config

BERT/DistilBERT

We can use the same script we used for training the BERT-based models, but with the --do_test option instead.

cd disco2labels

DistilBERT

CUDA_VISIBLE_DEVICES=0 taskset --cpu-list 1 python run_token_classifier.py \
--data_dir data/negra_sl/pos-pointer/ \
--transformer_model distilbert_model \
--transformer_pretrained_model distilbert-base-german-cased \
--task_name sl_tsv \
--model_dir /tmp/negra.pos-pointer.distilbert-base-german-cased.model \
--output_dir /tmp/negra.pos-pointer.distilbert-base-german-cased \
--path_gold_parenthesized data/negra/test.discbracket \
--evalb_param proper.prm \
--label_split_char {} \
--disco_encoder pos-pointer \
--parsing_paradigm constituency --do_test --eval_batch_size 8 --max_seq_length 240 

BERT

CUDA_VISIBLE_DEVICES=0 taskset --cpu-list 1 python run_token_classifier.py \
--data_dir data/negra_sl/pos-pointer/ \
--transformer_model bert_model \
--transformer_pretrained_model bert-base-german-dbmdz-cased \
--task_name sl_tsv \
--model_dir /tmp/negra.pos-pointer.bert-base-german-dbmdz-cased.model \
--output_dir /tmp/negra.pos-pointer.bert-base-german-dbmdz-cased \
--path_gold_parenthesized data/negra/test.discbracket \
--evalb_param proper.prm \
--label_split_char {} \
--disco_encoder pos-pointer \
--parsing_paradigm constituency --do_test --eval_batch_size 8 --max_seq_length 240 

To check all the parameter options: python run_token_classifier.py --help

Predicted PoS tag setups

We also release the NCRFpp BILSTM PoS tagging models to be able to generate the same predicted PoS tags than us for the training, development and test sets. You can download these PoS taggers as a part of the resources used in this work here.

To run the PoS taggers, you just need to run the model with NCRFpp:

python NCRF/main.py --config resources/ncrf_confs_postaggers/decode.negra.pos.config

where the content of the config file would be something like:

### Decode ###
status=decode
raw_dir=data/postag_datasets/negra/train.tsv
decode_dir=/tmp/dptb_predpostags.tsv
dset_dir=resources/ncrfpp_postaggers/negra.ncrfpp.sskip.postagger.dset
load_model_dir=resources/ncrfpp_postaggers/negra.ncrfpp.sskip.postagger.model

and data/postag_datasets/negra/train.tsv is a .tsv file with two columns: words and postags (which act as the labels in this case)

Training sequence labeling parsing models with predicted postags

To generate a new .discbracket file with predicted PoS tags use the script scripts/discbracket_pred_postags.py

python scripts/discbracket_pred_postags.py \
--input_disbracket data/negra/train.discbracket \
--input_pred_tags /tmp/negra_train_predpostags.tsv \
--out_disbracket data/negra_pred/train.discbracket

Follow the regular encoding and training processes with the /data/negra_pred/train.discbracket file, which now contains the predicted postags. Repeat the process for every split of the treebank.

Pretrained parsing models

To download the NEGRA, TIGER and DPTB NCRFpp BILSTM models trained with the pos-pointer encoding, click here.

To download the NEGRA, TIGER and DPTB BERT models trained with the pos-pointer encoding, click here.

TODO list

  • Add support for formats other than .discbracket.
  • Save extra parameters in BERT models for an easier/simpler way to load and run them later.
  • Improve robustness of the word to subword piece alignment for any random BERT-based model (specially for random uncased models).

References

David Vilares and Carlos GĂłmez-RodrĂ­guez, Discontinuous Constituent Parsing as Sequence Labeling, to appear at EMNLP-2020. Punta Cana, Dominican Republic (online due to COVID-19).

Acknowledgments

This work has received funding from the European Research Council (ERC), under the European Union's Horizon 2020 research and innovation programme (FASTPARSE, grant agreement No 714150).