Country-HMM

Training a Hidden Markov Model to identify country references (Capital cities & world leaders too)

What is this?

This is an old-school model made possible by NLTK, labeling tokens (~words) as relevant to countries, or not. The intent here is kind of like NER-light, trying to identify country references.

What's the "o" tag?

It's just a dummy tag to denote "other"

How good is it?

Accuracy is a fine metric, but it doesn't tell the whole story.

For starters, it's only a reflection of the training data. If we have flawed training data (which we do, 'cuz we're programmaticly tagging training data without human intervention), it'll make mistakes.
There's a class imbalance problem here: Most words are not country references. In fact, many news articles don't reference a country at all - Meaning if the model throws "Other" tags on every word, it'll have a very high accuracy.
To help provide a better understanding of where the model differs from the heuristic-based annotation, I'm checking a random sample of 500 article titles from my news articles corpora to see if they're a match.
- On average, somewhere between 110-150 annotations and HMM tags differ out of the 500 (~22-30%) across training runs. Sometimes this is good (the model generalizes), sometimes this is bad (the model stubbornly throws errant tags).

Where does it break?

Dual-meaning (polysemy) is a problem. This is clear in the example above: Jordan is a country, but Jim Jordan shouldn't be. This is somewhat to be expected from a 20+ year old model architecture, but I could probably clean it up a little with smarter annotation rules.
The model gets a little tag-happy with the beginnings of sentences. I originally thought the model was overfitting on article titles begin with "Country: Words words words", but after adding validation and cleaning steps this errant pattern remains (albiet in a muted form).
As the snip above shows, this accounts for the perpoderance of Annotation/HMM mismatches (~109 out of 115 mismatches)

What's next?