A Java implementation of different probabilistic part-of-speech (PoS) tagging techniques.
The PoS tagging techniques implemented are the following:
-
BaseLinePoSTagger Simple PoS tagger that assigns to each word the most common tag of that word.
-
ViterbiPoSTagger The Viterbi algorithm is a dynamic programming technique that assumes a Hidden Markov Model (HMM) in which the unobserved states are the correct sequence of part-of-speech tags while the visible output, dependent on the state, are the words that compose the sentence. This is an implementation of the Viterbi algorithm that uses bigrams, therefore it assumes that each state transition depends only on the previous state.
-
TrigramsPoSTagger This is an implementation of the Viterbi algorithm that uses trigrams instead of bigrams, therefore assumes that state transition depends on the previous two states. This implementation may suffer from sparseness issues, which are mitigated thanks to the deleted interpolation algorithm.
Furthermore, in order to improve the overall tagging performance, the following normalizing techniques have been implemented:
-
CapitalizeNormalizer Each word is capitalized. For example, both
Chair
andchair
becomeCHAIR
. -
LemmaNormalizer Each word is reduced to its lemma. For example,
go
,goes
,going
,went
, andgone
becomego
.
Finally, since datasets are not able to represent all the possible situations, different smoothing techniques are present in order to being able to estimate the probability that an unknown words can be assigned to a given tag. Some of the implemented smoothing techniques are the following:
-
OneOverNSmoother Given that the training dataset contains
nTag
tags. This smoother assigns a probability of1 / nTags
to each tag. -
NounSmoother This smoother take advantage of the fact that the
NOUN
tag is the most common tag. Given that the training dataset containsnTag
tags. It assigns a probability of(nTags - 1) / nTags
to theNOUN
tag,1 / nTags
otherwise. -
MorphItSmoother This smoother is able to give a more accurate probability by looking at how much frequent a tag is for a given word in the Morph-it resource.