SyntaxDot

Introduction

SyntaxDot is a sequence labeler and dependency parser using Transformer networks. SyntaxDot models can be trained from scratch or using pretrained models, such as BERT or XLM-RoBERTa.

In principle, SyntaxDot can be used to perform any sequence labeling task, but so far the focus has been on:

Part-of-speech tagging
Morphological tagging
Topological field tagging
Lemmatization
Named entity recognition

The easiest way to get started with SyntaxDot is to use a pretrained sticker2 model (SyntaxDot is currently compatbile with sticker2 models).

Features

Input representations:
- Word pieces
- Sentence pieces
Flexible sequence encoder/decoder architecture, which supports:
- Simple sequence labels (e.g. POS, morphology, named entities)
- Lemmatization, based on edit trees
- Simple API to extend to other tasks
- Dependency parsing as sequence labeling
Dependency parsing using deep biaffine attention and MST decoding.
Multi-task training and classification using scalar weighting.
Encoder models:
- Transformers
- Finetuning of BERT, XLM-RoBERTa, ALBERT, and SqueezeBERT models
Model distillation
Deployment:
- Standalone binary that links against PyTorch's libtorch
- Very liberal license

Documentation

References

SyntaxDot uses techniques from or was inspired by the following papers:

The biaffine dependency parsing layer is based on Deep biaffine attention for neural dependency parsing. Timothy Dozat and Christopher Manning, ICLR 2017.
The model architecture and training regime was largely based on 75 Languages, 1 Model: Parsing Universal Dependencies Universally. Dan Kondratyuk and Milan Straka, 2019, Proceedings of the EMNLP 2019 and the 9th IJCNLP.
The tagging as sequence labeling scheme was proposed by Dependency Parsing as a Sequence Labeling Task. Drahomíra Spoustová, Miroslav Spousta, 2010, The Prague Bulletin of Mathematical Linguistics, Volume 94.
The idea to combine this scheme with neural networks comes from Viable Dependency Parsing as Sequence Labeling. Michalina Strzyz, David Vilares, Carlos Gómez-Rodríguez, 2019, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
The encoding of lemmatization as edit trees was proposed in Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupała, 2008, PhD dissertation, Dublin City University.

Issues

You can report bugs and feature requests in the SyntaxDot issue tracker.

tensordot/syntaxdot