/pearley

A Probabilistic Earley Parser

Primary LanguageC++zlib LicenseZlib

The Earley Parser parse inputs from left to right sequentially by repeatedly applying the 3 steps: prediction, scanning and completion. For each input symbol, a list of Earley states is produced using these 3 steps and stored on the Earley chart. Each Earley state keeps track of the expansion of a particular production rule and the forward and inner probabilities associated with it, for more details, see Andrea Stolcke's technical report.

After a prediction step, the probabilty distribution of the next symbol can be computed from the list of current states by summing the forward probabilities of states that predicts the same terminal (symbol to the right of the dot) and normalising it over all possible terminals. This is implemented as EarleyParser::getNextWordTransitions(). The prefix probability of the inputs string can be computed by summing the forward probability of states produced by a scanning step.

The current implementation of the earley parser handles probabilistic context free grammar. It can also handle recursions in the grammar rules but not unit production rules (e.g.  A --> B; B--> A) or null productions rules, both of which cannot be produced by the ADIOS. It also does not yet handle robust parsing of ungrammatical input as it is unclear how prefix probabilisties can be derived from the resulting parial parses.



The main.cpp compiles a binary which demonstrate the usage of the Earley Parser.

Usage:
EarleyParser <rules_filename> <sentences_filename>

<rules_filename> is the name of the file containing the Probabilistic Context Free Grammars.  e.g.:

S _
1.0 S --> NP VP
1.0 NP --> Det N
1.0 VP --> VT NP
1.0 VP --> VI PP
1.0 PP --> P NP
1.0 Det --> a
1.0 N --> circle
1.0 N --> square
1.0 N --> triangle
1.0 VT --> touches
1.0 VI --> is
1.0 P --> above
1.0 P --> below



The first line of the file contains the symbol of the top level nonterminal (usually S) and the dummy state symbol, which is used by the parser internally, it is defined here to introduce additional clarity to the parser's output. Each rule consist of the weight of the production rule, the nonterminal that is to be expanded, an arrow "-->" and the list of symbols it expands to. The weights of production rules with the same nonterminal on the left hand side are normalised by the parser after loading.


<sentences_filename> is the name of the file containing sentences to be parsed, e.g.:

a circle touches a triangle



Using the two example files, gives the following output. Each state if follow by its forward and inner probabilities in [] and the probability of the viterbi parse tree in (). The viterbi parse tree is also produced at the end:

-----------------------------------------------------

0   -->.S    [1 1]      (1)

predicted
0 S -->.NP VP    [1 1]      (1)
0 NP -->.Det N    [1 1]     (1)
0 Det -->.a    [1 1]    (1)



a
-----------------------------------------------------

scanned
0 Det --> a.   [1 1]    (1)

completed
0 NP --> Det.N    [1 1]     (1)

predicted
1 N -->.circle    [0.333333 0.333333]       (0.333333)
1 N -->.square    [0.333333 0.333333]       (0.333333)
1 N -->.triangle    [0.333333 0.333333]     (0.333333)



circle
-----------------------------------------------------

scanned
1 N --> circle.   [0.333333 0.333333]       (0.333333)

completed
0 NP --> Det N.   [0.333333 0.333333]       (0.333333)
0 S --> NP.VP    [0.333333 0.333333]    (0.333333)

predicted
2 VP -->.VT NP    [0.166667 0.5]    (0.5)
2 VP -->.VI PP    [0.166667 0.5]    (0.5)
2 VT -->.touches    [0.166667 1]    (1)
2 VI -->.is    [0.166667 1]     (1)



touches
-----------------------------------------------------

scanned
2 VT --> touches.   [0.166667 1]    (1)

completed
2 VP --> VT.NP    [0.166667 0.5]    (0.5)

predicted
3 NP -->.Det N    [0.166667 1]      (1)
3 Det -->.a    [0.166667 1]     (1)



a
-----------------------------------------------------

scanned
3 Det --> a.   [0.166667 1]     (1)

completed
3 NP --> Det.N    [0.166667 1]      (1)

predicted
4 N -->.circle    [0.0555556 0.333333]      (0.333333)
4 N -->.square    [0.0555556 0.333333]      (0.333333)
4 N -->.triangle    [0.0555556 0.333333]    (0.333333)



triangle
-----------------------------------------------------

scanned
4 N --> triangle.   [0.0555556 0.333333]    (0.333333)

completed
3 NP --> Det N.   [0.0555556 0.333333]      (0.333333)
2 VP --> VT NP.   [0.0555556 0.166667]      (0.166667)
0 S --> NP VP.   [0.0555556 0.0555556]      (0.0555556)
0   --> S.   [0.0555556 0.0555556]      (0.0555556)


Finished parsing.

Computing parse tree from 3th state in the final set:   0 S --> NP VP.   [0.0555556 0.0555556]      (0.0555556)
0 --->
    1 ---> S
        2 ---> NP
            3 ---> Det
                4 ---> a
            5 ---> N
                6 ---> circle
        7 ---> VP
            8 ---> VT
                9 ---> touches
            10 ---> NP
                11 ---> Det
                    12 ---> a
                13 ---> N
                    14 ---> triangle



The binary compiled from the main.cpp, also outputs other less important informations.