Note: please install the plugin for chrome Tex All the things
In this academic project, we made use of the dataset from the CoNLL-2000 shared task on text chunking (Tjong Kim Sang and Buchholz, 2000). Text chunking is concerned with dividing the text into syntactically-related chunks of words, or phrases. These phrases are non-overlapping in the sense that a word can only be a member of one phrase. For example, consider the sentence:
He reckons the current account deficit will narrow to only #1.8 billion in September.
The segmentation of this sentence into chunks and their corresponding labels is shown in table 1. The chunk label contains the type of the chunk, e.q. I-NP for noun phrase words and I-VP for verb phrase words. Most chunk types have two kinds of labels to delineate the boundaries of the chunk, B-CHUNK for the first word of the chunk and I-CHUNK for every other word in the chunk. While all the necessary information to carry out this assignment is contained within this assignment specification, you may also find out more about this task at Here
Table 1.
He | B-NP |
reckons | B-VP |
the | B-NP |
current | I-NP |
account | I-NP |
deficit | I-NP |
will | B-VP |
narrow | I-VP |
to | B-PP |
only | B-NP |
# | I-NP |
1.8 | I-NP |
billion | I-NP |
in | B-PP |
September | B-NP |
. | 0 |
Instead of providing with raw text data, we have preprocessed and extracted features from this dataset. These are given in the compressed file "conll_train.zip" and "conll_test_features.zip". When extracted, you will find less "i.x" and "i.y" consisting of the features and chunk labels for the ith sentence, respectively.
Let
j k 1
indicates that the kth feature for the jth word/token in the sentence has value 1. Next, the “i.y” file contains the label
-
Dependencies
We require the following packages that can be directly installed from pip:
- GPflow >= 1.3.0
- numpy >= 1.11.1
- scikit-learn >= 0.17.1
- scipy >= 0.17.0
- tensorflow >= 1.11
-
Installation
- on MAC or linux
$ pip3 install --user -U pip $ pip3 install -r requirement.txt
- on Windows
> python -m pip install --user -U pip > pip install -r requirement.txt
The input data can be convert to a very high dimention Sparse Matrix. So we represented the input words/token with Scipy sparse Matrix class and reduce the dimention to 200 with TruncatedSVD on scikit-learn.
we pick up some inducing point with k-means method to reduce the consumption of the computational resource and to improve the model accuracy up to around 90%, which is obviosly prior to the softmax classifier (82%).
More Information on SVGP
I transplanted the SAVIGP on the Sparse-GP on python3 environment and fix the ARD bug in the kernel.py. I also turn on the pytorch cuda accerleration for SAVIGP model on my Github
More Information on SAVIGP
TODO:
- Need to modified the source code for fitting data and see how it works and fix the consumption of memory using tensorflow possibility or pytorch.
More Information on AutoGP
TODO:
- Need to transplant to the python3.
- [1] Tjong Kim Sang, E. F. and Buchholz, S. (2000). Introduction to the conll-2000 shared task: Chunking. In Proceedings of the 2nd workshop on Learning language in logic and the 4th conference on Computational natural language learning-Volume 7, pages 127–132. Association for Computational Linguistics.