📝 IncPar: Fully Incremental Neural Dependency and Constituency Parsing

A Python package for reproducing results of fully incremental dependency and constituency parsers described in:

Note: Our implementation was built from forking yzhangcs' SuPar v1.1.4 repository. The Vector Quantization module was extracted from lucidrains' vector-quantize-pytorch and Sequence Labeling encodings from Polifack's CoDeLin repositories.

Incremental Parsers

Dependency Parsing:
- Sequence Labeling (absolute, relative, PoS-based and bracketing encodings).
- Transition-based w. Arc-Eager.
Constituency Parsing:
- Sequence Labeling (absolute and relative encodings).
- Attach-Juxtapose.

Usage

In order to reproduce our experiments, follow the installation and deployment steps of SuPar, vector-quantize-pytorch and CoDeLin repositories. Supported functionalities are training, evaluation and prediction from CoNLL-U or PTB-bracketed files. We highly suggest to run our parsers using terminal commands in order to train and generate prediction files. In the future 🙌 we'll make available SuPar methods to easily test our parsers' performance from Python terminal.

Training

Dependency Parsing:

Sequence labeling Dependency Parser (SLDependencyParser): Inherits all arguments of the main class Parser and allows the flag --codes to specify encoding to configure the trees linearization (abs, rel, pos, 1p, 2p).

Experiment: Train absolute encoding parser with mGPT as encoder and LSTM layer as decoder to predict labels.

python3 -u -m supar.cmds.dep.sl train -b -c configs/config-mgpt.ini \
    -p ../results/models-dep/english-ewt/abs-mgpt-lstm/parser.pt \
    --codes abs --decoder lstm \
    --train ../treebanks/english-ewt/train.conllu \
    --dev ../treebanks/english-ewt/dev.conllu \
    --test ../treebanks/english-ewt/test.conllu

Model configuration (number and size of layers, optimization parameters, encoder selection) is specified using configuration files (see folder configs/). We provided the main configuration used for our experiments.

Transition-based Dependency Parser w. Arc-Eager (ArcEagerDependencyParser): Inherits the same arguments as the main class Parser.

Experiment: Train Arc-Eager parser using BLOOM-560M as encoder and a MLP-based decoder to predict transitions with delay $k=1$ ( --delay) and Vector Quantization (--use_vq).

python3 -u -m supar.cmds.dep.eager train -b -c configs/config-bloom560.ini \
    -p ../results/models-dep/english-ewt/eager-bloom560-mlp/parser.pt \
    --decoder=mlp --delay=1 --use_vq \
    --train ../treebanks/english-ewt/train.conllu \
    --dev ../treebanks/english-ewt/dev.conllu \
    --test ../treebanks/english-ewt/test.conllu

This will save in folder results/models-dep/english-ewt/eager-bloom560-mlp the following files:

parser.pt: PyTorch trained model.
metrics.pickle: Python object with the evaluation of test set.
pred.conllu: Parser prediction of CoNLL-U test file.

Constituency Parsing

Sequence Labeling Constituency Parser (SLConstituencyParser): Analogously to SLDependencyParser, it allows the flag --codes in order to specify the indexing to use (abs, rel).

python3 -u -m supar.cmds.const.sl train -b -c configs/config-mgpt.ini \
    -p ../results/models-con/ptb/abs-mgpt-lstm/parser.pt \
    --codes abs --decoder lstm \
    --train ../treebanks/ptb-gold/train.trees \
    --dev ../treebanks/ptb-gold/dev.trees \
    --test ../treebanks/ptb-gold/test.trees

Attach-Juxtapose Constituency Parser (AttachJuxtaposeConstituencyParser): From the original SuPar implementation, we added the delay and Vector Quantization flag:

python3 -u -m supar.cmds.const.aj train -b -c configs/config-bloom560.ini \
    -p ../results/models-con/ptb/aj-bloom560-mlp/parser.pt \
    --delay=2 --use_vq \
    --train ../treebanks/ptb-gold/train.trees \
    --dev ../treebanks/ptb-gold/dev.trees \
    --test ../treebanks/ptb-gold/test.trees

Evaluation

Our codes provides two evaluation methods from a .pt PyTorch:

Via Python prompt, loading the model with .load() method and evaluating with .evaluate():

>>> Parser.load('../results/models-dep/english-ewt/abs-mgpt-lstm/paser.pt').evaluate('../data/english-ewt/test.conllu')

Via terminal commands:

python -u -m supar.cmds.dep.sl evaluate -p --data ../data/english-ewt/test.conllu

Prediction

Prediction step can be also executed from Python prompt or terminal commands to generate a CoNLL-U file:

Python terminal with .predict() method:

>>> Parser.load('../results/models-dep/english-ewt/abs-mgpt-lstm/parser.pt')
    .predict(data='../data/english-ewt/abs-mgpt-lstm/test.conllu', 
            pred='../results/models-dep/english-ewt/abs-mgpt-lstm/pred.conllu')

Via terminal commands:

python -u -m supar.cmds.dep.sl predict -p \ 
    --data ../data/english-ewt/test.conllu \
    --pred ../results/models-dep/english-ewt/abs-mgpt-lstm/pred.conllu

Acknowledgments

This work has been funded by the European Research Council (ERC), under the Horizon Europe research and innovation programme (SALSA, grant agreement No 101100615), ERDF/MICINN-AEI (SCANNER-UDC, PID2020-113230RB-C21), Xunta de Galicia (ED431C 2020/11), Cátedra CICAS (Sngular, University of A Coruña), and Centro de Investigación de Galicia ‘‘CITIC’’.

Citation

@thesis{ezquerro-2023-syntactic,
  title     = {{Análisis sintáctico totalmente incremental basado en redes neuronales}},
  author    = {Ezquerro, Ana and Gómez-Rodríguez, Carlos and Vilares, David},
  institution = {University of A Coruña},
  year      = {2023},
  url       = {https://ruc.udc.es/dspace/handle/2183/33269}
}

@inproceedings{ezquerro-2023-challenges,
  title     = {{On the Challenges of Fully Incremental Neural Dependency Parsing}},
  author    = {Ezquerro, Ana and Gómez-Rodríguez, Carlos and Vilares, David},
  booktitle = {Proceedings of ICNLP-AACL 2023},
  year      = {2023}
}

anaezquerro/incpar