Text-chunking-based-on-GP

Note: please install the plugin for chrome Tex All the things

Background

In this academic project, we made use of the dataset from the CoNLL-2000 shared task on text chunking (Tjong Kim Sang and Buchholz, 2000). Text chunking is concerned with dividing the text into syntactically-related chunks of words, or phrases. These phrases are non-overlapping in the sense that a word can only be a member of one phrase. For example, consider the sentence:

He reckons the current account deficit will narrow to only #1.8 billion in September.

The segmentation of this sentence into chunks and their corresponding labels is shown in table 1. The chunk label contains the type of the chunk, e.q. I-NP for noun phrase words and I-VP for verb phrase words. Most chunk types have two kinds of labels to delineate the boundaries of the chunk, B-CHUNK for the first word of the chunk and I-CHUNK for every other word in the chunk. While all the necessary information to carry out this assignment is contained within this assignment specification, you may also find out more about this task at Here

Table 1.


He	B-NP
reckons	B-VP
the	B-NP
current	I-NP
account	I-NP
deficit	I-NP
will	B-VP
narrow	I-VP
to	B-PP
only	B-NP
#	I-NP
1.8	I-NP
billion	I-NP
in	B-PP
September	B-NP
.	0

Data

Instead of providing with raw text data, we have preprocessed and extracted features from this dataset. These are given in the compressed file "conll_train.zip" and "conll_test_features.zip". When extracted, you will find less "i.x" and "i.y" consisting of the features and chunk labels for the ith sentence, respectively.

Schema

Let $T_i$ be the length of the ith sentence, the number of words/tokens it contains. There is a $D$-dimensional binary feature vector for each word/token in the sentence, where $D$ = 2, 035, 523. Due to the high-dimensionality of the feature space, the “i.x” file provides a sparse representation of the feature vectors for the ith sentence. A row entry with the value

j k 1

indicates that the kth feature for the jth word/token in the sentence has value 1. Next, the “i.y” file contains the label $c \in {1, . . . , 23}$ of each of the $T_i$ words/tokens in the sentence.

Environment

Dependencies

We require the following packages that can be directly installed from pip:
- GPflow >= 1.3.0
- numpy >= 1.11.1
- scikit-learn >= 0.17.1
- scipy >= 0.17.0
- tensorflow >= 1.11

Installation

on MAC or linux

$ pip3 install --user -U pip
$ pip3 install -r requirement.txt

on Windows

> python -m pip install --user -U pip
> pip install -r requirement.txt

Preprocessing

The input data can be convert to a very high dimention Sparse Matrix. So we represented the input words/token with Scipy sparse Matrix class and reduce the dimention to 200 with TruncatedSVD on scikit-learn.

Model

Sparse Variational Gaussian approximation

we pick up some inducing point with k-means method to reduce the consumption of the computational resource and to improve the model accuracy up to around 90%, which is obviosly prior to the softmax classifier (82%).

More Information on SVGP

Scalable Automated Variational Inference for Gaussian Process

I transplanted the SAVIGP on the Sparse-GP on python3 environment and fix the ARD bug in the kernel.py. I also turn on the pytorch cuda accerleration for SAVIGP model on my Github

More Information on SAVIGP

TODO:

Need to modified the source code for fitting data and see how it works and fix the consumption of memory using tensorflow possibility or pytorch.

AutoGP

More Information on AutoGP