Training a Hidden Markov Model to identify country references (Capital cities & world leaders too)
This is an old-school model made possible by NLTK, labeling tokens (~words) as relevant to countries, or not. The intent here is kind of like NER-light, trying to identify country references.
It's just a dummy tag to denote "other"
Accuracy is a fine metric, but it doesn't tell the whole story.
- For starters, it's only a reflection of the training data. If we have flawed training data (which we do, 'cuz we're programmaticly tagging training data without human intervention), it'll make mistakes.
- There's a class imbalance problem here: Most words are not country references. In fact, many news articles don't reference a country at all - Meaning if the model throws "Other" tags on every word, it'll have a very high accuracy.
- To help provide a better understanding of where the model differs from the heuristic-based annotation, I'm checking a random sample of 500 article titles from my news articles corpora to see if they're a match.
- On average, somewhere between 110-150 annotations and HMM tags differ out of the 500 (~22-30%) across training runs. Sometimes this is good (the model generalizes), sometimes this is bad (the model stubbornly throws errant tags).
- Dual-meaning (polysemy) is a problem. This is clear in the example above:
Jordan
is a country, butJim Jordan
shouldn't be. This is somewhat to be expected from a 20+ year old model architecture, but I could probably clean it up a little with smarter annotation rules.
- The model gets a little tag-happy with the beginnings of sentences. I originally thought the model was overfitting on article titles begin with "Country: Words words words", but after adding validation and cleaning steps this errant pattern remains (albiet in a muted form).
- As the snip above shows, this accounts for the perpoderance of Annotation/HMM mismatches (~109 out of 115 mismatches)
- Consider adding additional classes, as it may provide additional context to the model
- Look into ways to use Part of Speech (POS) tags to decrease false-positives in training data. Country references should often be special cases of proper nouns.