Trex-Parser is a minimalist dependency parser loosely based on the model described in the 2012 paper by Bohner and Nivre. The feature sets are one-hot vectors of words and POS tags in the stack and buffer. At each time-step, the parser chooses one of the three transitions: LeftArc, RightArc or Shift that is assigned the highest joint probability by two models: a Multiclass Perceptron and an Arc model.
The parser is tuned on the English and German datasets of CoNLL 2006 shared task.
The Arc model assigns a probability to the transitions LeftArc and RightArc conditioned on the POS tags of the top 2 elements in the stack. In other words, it measures P(a|h, c): the probability of arc a, given head h and child c. For convenience we assign probability "1" to the transition Shift.
The multiclass Perceptron assigns a probability to each of the three transitions, given the full feature model.
Features are one-hot representations of the top w elements in stack and buffer. We take the word-form, lemma and POS tags of these elements. The size of w is determined by the hyperparameter ws (window size). The vectors are concatenated, resulting in a high-dimensional feature vector (681,145 and 1,137,613 dimensions for English and German respectively when w = 12).
However we only store 3 high-dimensional vectors: the weight vectors of the perceptron. The feature vectors, instead, are the indices of the high-dimensional feature vectors with values 1. This makes the parser feasibly fast: average inference time per sentence was 0,253s for German and 0.550s (with b1 = 25 and ws=12).
- epochs - number of training epochs
- ws - window size, number of top elements in stack and buffer used for feature extraction
- b1 - beam size
- alpha - skip arcs that are assigned probabilities less than this value (default 0 means we only allow arcs for POS-pairs that are seen in the traning set)
The model seems to have a bug: it attained UAS 0.935 for English on the development set, but on the test set the unlabeled attachment score was below 0.75.