/TreebankParser

Parser for treebanks based on Penn Treebank type of encoding that generates Probabilistic Context Free Grammars

Primary LanguageCApache License 2.0Apache-2.0

TreebankParser

(C) 2016-2018 by Damir Cavar <dcavar@iu.edu>

This code and the binaries are made available under the Apache License, Version 2.0, January 2004. For details see the included LICENSE.txt file.

This is a tool that reads treebank files and generates a probabilistic grammar for use in FLE.

Currently it can generate all Context-free Grammar rules from a treebank in the Penn-treebank format.

Take for example the test1.txt file in the current source repository. You can run treebankparser to generate a frequency profile of the rules:

./treebankparser -y S test1.txt

The -y S parameter generates an S-symbol for empty root nodes, as in test1.txt. The default is to generate ROOT as the label for such root nodes.

The out put should look like this:

1	ADJP --> JJ
1	IP-HLN --> VP
1	JJ --> 重要
1	NN --> 企业
1	NN --> 增长点
1	NN --> 外商
1	NN --> 外贸
1	NN --> 投资
2	NP --> NN
1	NP --> NP
1	NP-OBJ --> NP
1	NP-PN --> NR
1	NP-SBJ --> NN NN NN
1	NR --> **
1	S --> IP-HLN
1	VP --> NP-OBJ
1	VV --> 成为

The probability is tab-delimited from the rule. It can also be generated as a float using the -r parameter:

./treebankparser -r -y S test1.txt > res.log

The output should look like:

0.0555556       ADJP --> JJ
0.0555556       IP-HLN --> VP
0.0555556       JJ --> 重要
0.0555556       NN --> 企业
0.0555556       NN --> 增长点
0.0555556       NN --> 外商
0.0555556       NN --> 外贸
0.0555556       NN --> 投资
0.111111        NP --> NN
0.0555556       NP --> NP
0.0555556       NP-OBJ --> NP
0.0555556       NP-PN --> NR
0.0555556       NP-SBJ --> NN NN NN
0.0555556       NR --> **
0.0555556       S --> IP-HLN
0.0555556       VP --> NP-OBJ
0.0555556       VV --> 成为

The rules are printed to standard out with absolute or relative frequencies.

I am adding more features, e.g.:

  • reloading existing grammars (multi-batch cycles for larger corpus collections)
  • elimination of terminal rules
  • parsing alternative coding formats for syntactic trees or treebanks (e.g. XML, TEI XML)
  • output probabilities for Left-hand-side symbols only, rather than rules
  • generation of a Weighted Finite State Transducer representation, as coded in FLE

If you have ideas or suggestions, let me know.

Prerequisites

The tool is written in C++11 and requires the following libraries:

Compile

Use CLion or otherwise run:

cmake CMakeLists.txt
make