/open-sesame

A frame-semantic parsing system based on a softmax-margin SegRNN.

Primary LanguagePython

Frame-semantic parser for automatically detecting FrameNet frames and their frame-elements from sentences. The model is based on softmax-margin segmental recurrent neural nets, described in our paper Frame-Semantic Parsing with Softmax-Margin Segmental RNNs and a Syntactic Scaffold. An example of a frame-semantic parse is shown below

Frame-semantics example

Installation

This project is developed using Python 2.7. Other requirements include the DyNet library, and some NLTK packages.

pip install dynet
pip install nltk
python -m nltk.downloader averaged_perceptron_tagger wordnet

Data Preprocessing

This codebase only handles data in the XML format specified under FrameNet. However, we first reformat the data for ease of readability.

  1. First, create a data/ directory here, download FrameNet version 1.x and place it under data/fndata-1.x/. Also create a directory data/neural/fn1.x/ to convert to CoNLL 2009 format.

  2. Convert the data into a format similar to CoNLL 2009, but with BIO tags, by executing:

cd src/
python preprocess.py 2> err

The above script writes the train, dev and test files in the required format into the data/neural/fn1.x/ directory. There is plenty of noise in the annotations. The annotations which could not be used, along with the error messages, gets spit out to the standard error.

  1. [Optional, but highly recommended] If you want to use pretrained GloVe word embeddings, download and extract them under data/. Run the preprocessing with an extra argument for the intended GloVe file.
python preprocess.py glove.6B.100d.txt 2> err

This trims the GloVe files to the FrameNet vocabulary, to ease memory requirements. For example, the above creates data/glove.6B.100d.framevocab.txt to be used by our models.

Frame Identification

Frame identification is based on a bidirectional LSTM model.

Training

To train the frame identification module, execute:

cd src/
python frameid.py

This saves the best model on validation data in the directory src/tmp/, which will be pointed to by the symbolic link src/model.frameid.1.x. Pre-trained models coming soon.

Test

To test under the best model in src/model.frameid.1.x, execute:

python frameid.py --mode test > frameid.log

frameid.log will contain example-wise analysis. The output, in CoNLL 2009 format will be written to predicted.1.x.frameid.test.out and in the frame-elements file format to my.predict.test.frame.elements.

Frame-Element (Argument) Identification

Argument identification is based on a segmental recurrent neural net model, used as a baseline in our paper.

Training

To train an argument identifier, execute:

cd src/
python segrnn-argid.py 2> err

This saves the best model on validation data in the directory src/tmp/, which will be pointed to by the symbolic link src/model.segrnn-argid.1.x. Pre-trained models coming soon.

Test

To test under the best model in src/model.segrnn-argid.1.x, execute:

python segrnn-argid.py --mode test > argid.log

Contact and Reference

For questions and usage issues, please contact swabha@cs.cmu.edu. If you use open-sesame for research, please cite our paper as follows:

@article{swayamdipta:17,
  title={{Frame-Semantic Parsing with Softmax-Margin Segmental RNNs and a Syntactic Scaffold}},
  author={Swabha Swayamdipta and Sam Thomson and Chris Dyer and Noah A. Smith},
  journal={arXiv preprint arXiv:1706.09528},
  year={2017}
}