Rule-based Feature Extraction and Applying Machine Learning Models for Conference Resolution from the WikiCoref corpus
Author: Hung Phan and Ziwei Zhou
In this project, we design an approach for building a training data and testing data from a wellknown corpus WikiCoref, then apply these data on machine learning models for mention co-refered prediction. Our machhine learning will give a document in CONLL form as input, and the output is the set of mentions and the result of co-reference result (1 for co-refer and 0 otherwise for each pair of mentions in the document. We conduct the data which have a set of mentions in 30 documents of WikiCoref, including with feature vector for each pair of mentions and the label for the co-reference result. The corpus is over 1.6 millions pairs of mentions in WikiCoref. We apply several machine learning approaches and do the cross-validation for each mentions and have the accuracy as follow;
- GaussianNB (55.67% in total accuracy)
- LogisticRegression (92.16%)
- DecisionTreeClassifier (92.47%)
- RandomForestClassifier (92.65%)
- AdaBoostClassifier (to be updated)
- LinearDiscriminantAnalysis (57.29%)
- QuadraticDiscriminantAnalysis()
- LinearSVC
- NuSVC
- MLPClassifier
- GradientBoostingClassifier
The process of implementing this project is done by 3 steps:
Step 1: Generate Conll format from Ontonote file of WikiCoref:
- Since WikiCoref doesn't have manualled labeling information like Conll dataset, we use Stanford NLP Parser to get parsed tree information and semantic labeling information for each word in WikiCoref.
Step 2: Extracting feature from WikiCoref in Conll format:
- We extract the feature for each mentions and other features relate to pairs of mentions (such as mention distance, mention spans) to get a 70-dimention of vector for each pair of mentions.
- We use Spacy library and improve the code from NeuralCoref (https://github.com/huggingface/neuralcoref) to handling feature for WikiCoref.
Step 3: Writing code for applying Machine Learning models predicting mention co-referred.
- We write code in python that take all feature and label data in csv format and get the total accuracy by cross validation.
- If the evaluation is time consumming, you can use our compact version of training-testing data.
In overal, in this project we build Machine Learning models that given 2 arbitrary mentions and produce the output that predict 2 mentions are co-referred or not. We think that the 2 remaining challenges of this corpus is automatically generated label and the corpus contains many pairs with 0 label so it might un-balance.The details of each steps can be seen in the documentation folder.