A typical text consists of sentences that are glued together in a systematic way to form a coherent discourse. Shallow discourse parsing is the task of parsing a piece of text into a set of discourse relations between two adjacent or non-adjacent discourse units. We call this task shallow discourse parsing because the relations in a text are not connected to one another to form a connected structure in the form of a tree or graph.

In this project, I extracted features word pair, production rules and dependency rules. And I added first last pairs to increase accuracy. After extract the features, I used Mutual Information to decrease dimentions and trained the model with maxent classifier in ntlk. Final results reached accuracy of 40.7 on the test data set. More results can be found in the Project report.(in Chinese)

File description

data/: some files for training

lib/: some open tool libraries for feature extraction

model/: some models saved

test/: directory for save files generated in testing

cleandata.py: some functions for data cleaning

config.py: constants in programs

mytest.py: test program

mytrain.py:train program

preprocess.py: some functions for generating train data

scorer.py: standard scorer program

predict.json: the default test output


java -version >= 1.8.0 ntlk 3.0.0 sklearn


For train: python mytrain.py The default output is 'train.model'

For test: usage: mytest.py [-h] [rule] file

test model with options rule all: generate dependency rules and production rules drule: generate dependency rules prule: generate production rules none: use generated rules file test data file required


python mytest.py all test_pdtb.nosense.json	

The default output is 'predict.json'

