A GitHub repository for this project is available online.
The goal of this project was to implement and train a part-of-speech (POS) tagger, as described in "Speech and Language Processing" (Jurafsky and Martin).
A hidden Markov model is implemented to estimate the transition and emission probabilities from the training data. The Viterbi algorithm is used for decoding, i.e. finding the most likely sequence of hidden states (POS tags) for previously unseen observations (sentences).
The HMM is trained on bigram distributions (distributions of pairs of adjacent tokens). The first pass over the training data generates a fixed list of vocabulary tokens. Any token occurring less than twice in the training data is assigned a special unknown word token based on a few selected morphological idiosyncrasies of common English word classes (e.g. most tokens with the suffix "-ism" are nouns). The second pass uses the transformed training data to collect the bigram transition and emission counts and saves them to a model file.
To decode the development and test splits, the input sequence is first transformed according to the unknown word rules mentioned above. The transition and emission counts are then converted to proper probability distributions, using additive smoothing to estimate probabilities for transitions/emissions that have not been observed in the training data. A pseudo count alpha > 0
is used as the smoothing parameter, with alpha = 0.001
giving best results on the development split (see results below).
For training and decoding, the input sequences are treated as a continuous sequence of tokens. Sentence boundaries are marked by introducing an artificial "start-of-sentence" state ("--s--"
) occuring with "newline" tokens ("--n--"
). It takes about 60 seconds to train the model and decode the development split.
The HMM is implemented in scripts/hmm.py
. The trained model with transition, emission, and state counts is stored in data/hmm_model.txt
. A sorted list of vocabulary tokens is stored in data/hmm_vocab.txt
.
The Viterbi algorithm is implemented in scripts/viterbi.py
. Output files containing the predicted POS tags are written to the output/
directory. All settings can be adjusted by editing the paths specified in scripts/settings.py
.
To (re-)run the tagger on the development and test set, run:
[viterbi-pos-tagger]$ python3.6 scripts/hmm.py dev
[viterbi-pos-tagger]$ python3.6 scripts/hmm.py test
You should expect similar output pretty much immediately:
[viterbi-pos-tagger]$ python3.6 scripts/hmm.py dev
Generating vocabulary...
Training model...
Decoding dev split...
Words processed: 5000
Words processed: 10000
Words processed: 15000
Words processed: 20000
Words processed: 25000
Words processed: 30000
Done
python scripts/hmm.py dev 64.14s user 0.75s system 95% cpu 1:07.72 total
Please note that unless you run rm -rf data/hmm*
to delete the old model files, they will not be regenerated during the next run.
The evaluation script is implemented in scripts/eval.py
. It prints a text report showing the main classification metrics, as well as the overall accuracy classification score. It also writes a confusion matrix to docs/confusion_matrix.csv
.
First create a virtual environment and pip install
all the requirements:
[viterbi-pos-tagger]$ virtualenv -p python3.6 env/
[viterbi-pos-tagger]$ source env/bin/activate
[viterbi-pos-tagger]$ pip install -r requirements.txt
Then run the evaluation script as follows:
[viterbi-pos-tagger]$ python scripts/eval.py <TRUE .pos> <PREDICTED .pos>
To evaluate the results on the development and test set, run:
[viterbi-pos-tagger]$ python scripts/eval.py WSJ/WSJ_24.pos output/wsj_24.pos # dev
[viterbi-pos-tagger]$ #python scripts/eval.py WSJ/WSJ_23.pos output/wsj_23.pos # test
As usual, section 24 of the WSJ corpus is used as the development set. The tagged output file for the development set is output/wsj_24.pos
. The original corpus files are WSJ/WSJ_24.words
and WSJ/WSJ_24.pos
.
Initially, Viterbi decoding with a uniform probability for unknown words and add-one smoothing gave a tagging accuracy of 92.88% on the development set. Adding morphological features to improve the handling of unknown words increased accuracy to a score of 93.34%. Finally, tuning the additive smoothing parameter resulted in a tagging accuracy score of 95.31% on the development set.
For more details, please see docs/accuracy.md
.
alpha | accuracy score |
---|---|
1.0 | 0.9334307369190028 |
0.5 | 0.9419839892856056 |
0.2 | 0.9474020637384714 |
0.1 | 0.9498980306212522 |
0.001 | 0.953063647155511 |
Below is the classification report for the tagging accuracy on the development set.
precision recall f1-score support
# 1.00 1.00 1.00 3
$ 1.00 1.00 1.00 216
'' 1.00 1.00 1.00 247
( 1.00 1.00 1.00 54
) 1.00 1.00 1.00 53
, 1.00 1.00 1.00 1671
. 1.00 1.00 1.00 1337
: 1.00 1.00 1.00 221
CC 1.00 0.99 1.00 877
CD 0.98 0.98 0.98 1054
DT 0.99 0.99 0.99 2856
EX 0.97 1.00 0.99 37
FW 0.29 0.50 0.36 8
IN 0.99 0.95 0.97 3612
JJ 0.86 0.94 0.90 2036
JJR 0.86 0.87 0.87 93
JJS 0.96 0.94 0.95 53
LS 1.00 0.60 0.75 5
MD 1.00 0.98 0.99 339
NN 0.96 0.94 0.95 4541
NNP 0.94 0.97 0.95 3216
NNPS 0.77 0.51 0.62 127
NNS 0.93 0.96 0.94 2050
PDT 0.88 0.95 0.91 22
POS 0.99 0.99 0.99 299
PRP 0.99 0.99 0.99 538
PRP$ 0.99 1.00 0.99 271
RB 0.87 0.91 0.89 1044
RBR 0.73 0.76 0.75 54
RBS 0.95 0.95 0.95 20
RP 0.54 0.89 0.67 87
SYM 1.00 0.80 0.89 10
TO 1.00 1.00 1.00 805
UH 0.33 0.25 0.29 4
VB 0.95 0.93 0.94 1010
VBD 0.94 0.90 0.92 1020
VBG 0.93 0.82 0.87 528
VBN 0.85 0.82 0.83 758
VBP 0.92 0.89 0.90 422
VBZ 0.94 0.95 0.94 701
WDT 0.91 0.95 0.93 123
WP 0.97 0.99 0.98 90
WP$ 1.00 1.00 1.00 7
WRB 1.00 0.99 0.99 83
`` 1.00 1.00 1.00 251
avg / total 0.95 0.95 0.95 32853
Section 23 of the WSJ corpus is usually reserved for testing. The tagged output file for the test set is output/wsj_23.pos
. The original corpus file is WSJ/WSJ_23.words
. Note that the original .pos
file for the test set has not yet been released.
To achieve optimal results on the test split, the additive smoothing alpha parameter is set to alpha = 0.001
. The training file is set to WSJ/WSJ_02-21.pos
.