/pdtb-parser

A PDTB-Styled End-to-End Discourse Parser

Primary LanguageTeXGNU General Public License v3.0GPL-3.0

Java End-to-End PDTB-Styled Discourse Parser

PDTB parser based on:

Ziheng Lin, Hwee Tou Ng and Min-Yen Kan (2014). A PDTB-Styled End-to-End Discourse Parser. Natural Language Engineering, 20, pp 151-184. Cambridge University Press.

Developer: Ilija Ilievski
Version: 2.0.2
Last update: 7-Nov-2015

Requires Java 1.7+. Tested only on Mac and Linux OS.

Usage

  1. Download the parser from here.
  2. Extract the file with:
    tar -xzvf pdtb-parser.tar.gz
  3. From the extracted pdtb-parser folder run:
    java -jar parser.jar examples/wsj_2300.txt

Replace the argument examples/wsj_2300.txt with the file or the folder containing text files you want to parse. The resulting pipe and auxiliary files would be in a folder named output in each folder containing text files. Note that when the argument is a folder, the parser will search for files ending in .txt in the folder and all of it's subfolders.

If you want to use level 1 type of relations (for more info see this or read the PDTB 2.0 annotation manual) open the config.properties and set SEMANTIC_LEVEL=1 and MODEL_PATH=models/level_1/. Check config.properties for all the options.

Using the parser with the BioDRB Corpus

  1. Download the BioDRB parser from here.
  2. Extract the file with:
    tar -xzvf biodrb-parser.tar.gz
  3. In the extracted folder biodrb-parser unzip the BioDRB_corpus.zip file
  4. Check in config.properties if the paths to the corpus are correct. BIO_DRB_RAW_PATH should point to GeniaRaw/Genia/ and BIO_DRB_ANN_PATH to GeniaAnn/Genia/. The BIO_DRB_TREE_PATH will be created by the parser.
  5. From the extracted biodrb-parser folder run:
    java -jar bio-parser.jar [program_arguments]

Program arguments can be one of the following:

  • --train-only - will build a BioDRB model using all 24 articles. The model files will be stored in MODEL_PATH
  • --cross-validation - will do 10-fold cross validation. The model and test files will be stored in MODEL_PATH/CV_K, where k is the fold index.
  • --score-pdtb pdtb_pipe_folder biodrb_pipe_folder - will score the PDTB parser on the BioDRB corpus. pdtb_pipe_folder should contain the pipes generated by the PDTB parser and biodrb_pipe_folder should contain the BioDRB gold standard pipes (should end in .pipe). You can use the pre-generated pipe files with --score-pdtb pdtb_vs_biodrb pdtb_vs_biodrb/bio_drb_gold.

The mapping from PDTB relation sense type to BioDRB is done according to this table. Precompiled results for 10 fold cross validation compared with the PDTB results can be found here.

Check config.properties for all the options.

Modifying the parser

Using different PDTB sections

To train and/or test the parser on different PDTB sections follow these steps:

  1. Clone with git clone https://github.com/WING-NUS/pdtb-parser.git or download it from here.

  2. Obtain the PTB and PDTB corpus files and move them to external/data/. The external/data/ directory should look like this.

  3. From the project root directory run the following:

    • java -jar runnable_jars/pdtb-tools/span-tree-extractor.jar To generate auxiliary files (external/data/ should now look like this)
    • java -jar runnable_jars/pdtb-tools/train-parser.jar To train the parser
    • java -jar runnable_jars/pdtb-tools/test-parser.jar To test the parser (GS+EP option)
  4. Set the output folder, train and test sections in config.properties.

PDTB Pipe Format

The parser uses the PDTB pipe-delimited format where every relation is represented on a single line and values are delimited by the pipe symbol. There must be 48 columns, but certain values may be blank.

The following lists the column values. For precise definitions of the terms used, please consult the PDTB 2.0 annotation manual.

Note the zero-based column index

  • Col 0: Relation type (Explicit/Implicit/AltLex/EntRel/NoRel)
  • Col 1: Section number (0-24)
  • Col 2: File number (0-99)
  • Col 3: Connective/AltLex SpanList (only for Explicit and AltLex)
  • Col 4: Connective/AltLex GornAddressList (only for Explicit and AltLex)
  • Col 5: Connective/AltLex RawText (only for Explicit and AltLex)
  • Col 6: String position (only for Implicit, EntRel and NoRel)
  • Col 7: Sentence number (only for Implicit, EntRel and NoRel)
  • Col 8: ConnHead (only for Explicit)
  • Col 9: Conn1 (only for Implicit)
  • Col 10: Conn2 (only for Implicit)
  • Col 11: 1st Semantic Class corresponding to ConnHead, Conn1 or AltLex span (only for Explicit, Implicit and AltLex)
  • Col 12: 2nd Semantic Class corresponding to ConnHead, Conn1 or AltLex span (only for Explicit, Implicit and AltLex)
  • Col 13: 1st Semantic Class corresponding to Conn2 (only for Implicit)
  • Col 14: 2nd Semantic Class corresponding to Conn2 (only for Implicit)
  • Col 15: Relation-level attribution: Source (only for Explicit, Implicit and AltLex)
  • Col 16: Relation-level attribution: Type (only for Explicit, Implicit and AltLex)
  • Col 17: Relation-level attribution: Polarity (only for Explicit, Implicit and AltLex)
  • Col 18: Relation-level attribution: Determinacy (only for Explicit, Implicit and AltLex)
  • Col 19: Relation-level attribution: SpanList (only for Explicit, Implicit and AltLex)
  • Col 20: Relation-level attribution: GornAddressList (only for Explicit, Implicit and AltLex)
  • Col 21: Relation-level attribution: RawText (only for Explicit, Implicit and AltLex)
  • Col 22: Arg1 SpanList
  • Col 23: Arg1 GornAddress
  • Col 24: Arg1 RawText
  • Col 25: Arg1 attribution: Source (only for Explicit, Implicit and AltLex)
  • Col 26: Arg1 attribution: Type (only for Explicit, Implicit and AltLex)
  • Col 27: Arg1 attribution: Polarity (only for Explicit, Implicit and AltLex)
  • Col 28: Arg1 attribution: Determinacy (only for Explicit, Implicit and AltLex)
  • Col 29: Arg1 attribution: SpanList (only for Explicit, Implicit and AltLex)
  • Col 30: Arg1 attribution: GornAddressList (only for Explicit, Implicit and AltLex)
  • Col 31: Arg1 attribution: RawText (only for Explicit, Implicit and AltLex)
  • Col 32: Arg2 SpanList
  • Col 33: Arg2 GornAddress
  • Col 34: Arg2 RawText
  • Col 35: Arg2 attribution: Source (only for Explicit, Implicit and AltLex)
  • Col 36: Arg2 attribution: Type (only for Explicit, Implicit and AltLex)
  • Col 37: Arg2 attribution: Polarity (only for Explicit, Implicit and AltLex)
  • Col 38: Arg2 attribution: Determinacy (only for Explicit, Implicit and AltLex)
  • Col 39: Arg2 attribution: SpanList (only for Explicit, Implicit and AltLex)
  • Col 40: Arg2 attribution: GornAddressList (only for Explicit, Implicit and AltLex)
  • Col 41: Arg2 attribution: RawText (only for Explicit, Implicit and AltLex)
  • Col 42: Sup1 SpanList (only for Explicit, Implicit and AltLex)
  • Col 43: Sup1 GornAddress (only for Explicit, Implicit and AltLex)
  • Col 44: Sup1 RawText (only for Explicit, Implicit and AltLex)
  • Col 45: Sup2 SpanList (only for Explicit, Implicit and AltLex)
  • Col 46: Sup2 GornAddress (only for Explicit, Implicit and AltLex)
  • Col 47: Sup2 RawText (only for Explicit, Implicit and AltLex)

Example relation:

Explicit|18|70|262..265|1,0|But|||but|||Comparison.Contrast||||Wr|Comm|Null|Null||||9..258|0|From a helicopter a thousand feet above Oakland after the second-deadliest earthquake in U.S. history, a scene of devastation emerges: a freeway crumbled into a concrete sandwich, hoses pumping water into once-fashionable apartments, abandoned autos|Inh|Null|Null|Null||||266..354|1,1;1,2;1,3|this quake wasn't the big one, the replay of 1906 that has been feared for so many years|Inh|Null|Null|Null|||||||||

BioDRB Pipe Format

There are 27 columns, however only 7 are used.

  • Col 0: Relation type (Explicit, Implicit, AltLex, NoRel)
  • Col 1: (Sets of) Text span offset for connective (when explicit) (eg. 472..474)
  • Col 7: Connective string “inserted” for Implicit relation
  • Col 8: Sense1 of Explicit Connective (or Implicit Connective)
  • Col 9: Sense2 of Explicit Connective (or Implicit Connective)
  • Col 14: (Sets of) Text span offset for Arg1
  • Col 20: (Sets of) Text span offset for Arg2

More details at:

Prasad R, McRoy S, Frid N, Joshi A and Yu H. 2011. BioDRB: The Biomedical Discourse Relation Bank. BMC Bioinformatics

Example relation: Explicit|258..260|Wr|Comm|Null|Null|||Purpose.Enablement||||||182..257|Inh|Null|Null|Null||261..298|Inh|Null|Null|Null||

External libraries used

Stanford's CoreNLP Natural Language Processing Toolkit for reading and generating parse trees.

Reference:

  • Manning, Christopher D., Surdeanu, Mihai, Bauer, John, Finkel, Jenny, Bethard, Steven J., and McClosky, David. 2014. The Stanford CoreNLP Natural Language Processing Toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55-60.

Two old versions of the Charniak parser. Copyright Mark Johnson, Eugene Charniak, 24th November 2005 --- August 2006. References:

  • Eugene Charniak and Mark Johnson. Coarse-to-fine n-best parsing and MaxEnt discriminative reranking. Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2005.

  • Eugene Charniak. A maximum-entropy-inspired parser. Proceedings of the 1st North American chapter of the Association for Computational linguistics conference. Association for Computational Linguistics, 2000.

Copyright notice and statement of copying permission

Copyright © 2015 WING, NUS and NUS NLP Group.

This program is free software: you can redistribute it and/or modify it under the terms of the
GNU General Public License as published by the Free Software Foundation, either version 3 of the
License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If
not, see http://www.gnu.org/licenses/.

Other licensing terms are available, please contact the authors if you require other licensing terms.