/RSTParser

An shift-reduce based RST parser

Primary LanguagePythonMIT LicenseMIT

RST Parser

If you are looking for a RST paper which is ready to use, please check out this repository

If you need a framework to develop your own RST parser, please keep reading :-)

Basic Description

RST parser for document-level discourse parsing. The parsing algorithm is shift-reduce parsing, and the parsing model is a offline trained multi-class classifier.

To obtain a good performance, you can:

  • add more features into the feature generator (in feature.py)
  • tune the parameters in parsing model (in model.py). For now, I simply use LinearSVC with default parameter setting.

Demo

Start from "main.py" for a demo

Modules

  • tree: any operation about an RST tree is included in this module. For example
    • Build general/binary RST tree from annotated file
    • Binarize a general RST tree to the binary form (original RST trees in the RST treebank may not in the binary form)
    • Generate bracketing sequence for evaluation
    • Write an RST tree into file (not implemented yet)
    • Generate Shift-reduce parsing action examples
    • Get all EDUs from the RST tree
  • parser: an implementation of the shift-reduce parsing algorithm, including following functions:
    • Initialize parsing status given a sequence of texts
    • Change the status according to a specific parsing action
    • Get the status of stack/queue
    • Check whether should stop parsing
  • model: an parsing model module, where a trained parsing model could predict parsing actions. This module includes:
    • Batch training on the data generated by the data module
    • Predict parsing actions for a given feature set
    • Save/load parsing model
  • feature: an feature generator, which can generate features from current stack/queue status.
  • data: generate training data for offline training

Main Classes

(For all the following functions, please refer to the code for more explanation)

  • RSTTree (in tree module):
    • build(): Build an binary RST tree from an annotated discourse file
    • generate_sample(): Generate a sequence of parsing actions and the corresponding training examples, which can be used for offline training on parsing model
    • getedutext(): Get a sequence of EDU texts from the given RST tree
    • bracketing: Generate bracketing sequence for evaluation
  • SRParser (in parser module):
    • init(texts): Initialize the queue status from the given text sequence. Each element in this sequence will be treated as an EDU
    • operate(action_tuple): Change the queue/stack according to the action tuple, for example, the operation (Shift, None, None) will move one element from the head of the queue to the top of the stack
    • getparsetree(): Return the entire RST tree
  • FeatureGenerator (in feature module):
    • features(): the major generator which could extract all the necessary features from current queue/stack. You can extend this generator by calling other sub-functions in it.
  • ParsingModel (in model module):
    • train(trnM, trnL): Offline training on the parsing model (aka, a multi-class classifier) from the given training data trnM and corresponding labels trnL
    • predict(features): Predict a parsing action according to the given feature generator
    • sr_parse(texts): Performing shift-reduce RST parsing on the given text sequence. Each element in this sequence will be treated as an EDU
  • Data (in data module):
    • buildvocab(thresh): Build feature vocab by removing some low-frequency features. The same vocab will also be used for future parsing work in test stage.
    • buildmatrix(): Build data matrix for offline training
    • savematrix(fname): Save data matrix and corresponding labels into fname
    • getvocab(): Get feature vocab
    • savevocab(fname): Save feature vocab and relation mapping (from relations to indices) into fname

Reference