/incpar

Fully Incremental Neural Dependency and Constituency Parsing

Primary LanguagePythonMozilla Public License 2.0MPL-2.0

📝 IncPar: Fully Incremental Neural Dependency and Constituency Parsing

A Python package for reproducing results of fully incremental dependency and constituency parsers described in:

Note: Our implementation was built from forking yzhangcs' SuPar v1.1.4 repository. The Vector Quantization module was extracted from lucidrains' vector-quantize-pytorch and Sequence Labeling encodings from Polifack's CoDeLin repositories.

Incremental Parsers

Usage

In order to reproduce our experiments, follow the installation and deployment steps of SuPar, vector-quantize-pytorch and CoDeLin repositories. Supported functionalities are training, evaluation and prediction from CoNLL-U or PTB-bracketed files. We highly suggest to run our parsers using terminal commands in order to train and generate prediction files. In the future 🙌 we'll make available SuPar methods to easily test our parsers' performance from Python terminal.

Training

Dependency Parsing:

  • Sequence labeling Dependency Parser (SLDependencyParser): Inherits all arguments of the main class Parser and allows the flag --codes to specify encoding to configure the trees linearization (abs, rel, pos, 1p, 2p).

Experiment: Train absolute encoding parser with mGPT as encoder and LSTM layer as decoder to predict labels.

python3 -u -m supar.cmds.dep.sl train -b -c configs/config-mgpt.ini \
    -p ../results/models-dep/english-ewt/abs-mgpt-lstm/parser.pt \
    --codes abs --decoder lstm \
    --train ../treebanks/english-ewt/train.conllu \
    --dev ../treebanks/english-ewt/dev.conllu \
    --test ../treebanks/english-ewt/test.conllu

Model configuration (number and size of layers, optimization parameters, encoder selection) is specified using configuration files (see folder configs/). We provided the main configuration used for our experiments.

Experiment: Train Arc-Eager parser using BLOOM-560M as encoder and a MLP-based decoder to predict transitions with delay $k=1$ ( --delay) and Vector Quantization (--use_vq).

python3 -u -m supar.cmds.dep.eager train -b -c configs/config-bloom560.ini \
    -p ../results/models-dep/english-ewt/eager-bloom560-mlp/parser.pt \
    --decoder=mlp --delay=1 --use_vq \
    --train ../treebanks/english-ewt/train.conllu \
    --dev ../treebanks/english-ewt/dev.conllu \
    --test ../treebanks/english-ewt/test.conllu

This will save in folder results/models-dep/english-ewt/eager-bloom560-mlp the following files:

  1. parser.pt: PyTorch trained model.
  2. metrics.pickle: Python object with the evaluation of test set.
  3. pred.conllu: Parser prediction of CoNLL-U test file.

Constituency Parsing

python3 -u -m supar.cmds.const.sl train -b -c configs/config-mgpt.ini \
    -p ../results/models-con/ptb/abs-mgpt-lstm/parser.pt \
    --codes abs --decoder lstm \
    --train ../treebanks/ptb-gold/train.trees \
    --dev ../treebanks/ptb-gold/dev.trees \
    --test ../treebanks/ptb-gold/test.trees
python3 -u -m supar.cmds.const.aj train -b -c configs/config-bloom560.ini \
    -p ../results/models-con/ptb/aj-bloom560-mlp/parser.pt \
    --delay=2 --use_vq \
    --train ../treebanks/ptb-gold/train.trees \
    --dev ../treebanks/ptb-gold/dev.trees \
    --test ../treebanks/ptb-gold/test.trees

Evaluation

Our codes provides two evaluation methods from a .pt PyTorch:

  1. Via Python prompt, loading the model with .load() method and evaluating with .evaluate():
>>> Parser.load('../results/models-dep/english-ewt/abs-mgpt-lstm/paser.pt').evaluate('../data/english-ewt/test.conllu')
  1. Via terminal commands:
python -u -m supar.cmds.dep.sl evaluate -p --data ../data/english-ewt/test.conllu

Prediction

Prediction step can be also executed from Python prompt or terminal commands to generate a CoNLL-U file:

  1. Python terminal with .predict() method:
>>> Parser.load('../results/models-dep/english-ewt/abs-mgpt-lstm/parser.pt')
    .predict(data='../data/english-ewt/abs-mgpt-lstm/test.conllu', 
            pred='../results/models-dep/english-ewt/abs-mgpt-lstm/pred.conllu')
  1. Via terminal commands:
python -u -m supar.cmds.dep.sl predict -p \ 
    --data ../data/english-ewt/test.conllu \
    --pred ../results/models-dep/english-ewt/abs-mgpt-lstm/pred.conllu

Acknowledgments

This work has been funded by the European Research Council (ERC), under the Horizon Europe research and innovation programme (SALSA, grant agreement No 101100615), ERDF/MICINN-AEI (SCANNER-UDC, PID2020-113230RB-C21), Xunta de Galicia (ED431C 2020/11), CĂĄtedra CICAS (Sngular, University of A Coruña), and Centro de InvestigaciĂłn de Galicia ‘‘CITIC’’.

Citation

@thesis{ezquerro-2023-syntactic,
  title     = {{AnĂĄlisis sintĂĄctico totalmente incremental basado en redes neuronales}},
  author    = {Ezquerro, Ana and GĂłmez-RodrĂ­guez, Carlos and Vilares, David},
  institution = {University of A Coruña},
  year      = {2023},
  url       = {https://ruc.udc.es/dspace/handle/2183/33269}
}

@inproceedings{ezquerro-2023-challenges,
  title     = {{On the Challenges of Fully Incremental Neural Dependency Parsing}},
  author    = {Ezquerro, Ana and GĂłmez-RodrĂ­guez, Carlos and Vilares, David},
  booktitle = {Proceedings of ICNLP-AACL 2023},
  year      = {2023}
}