PoS tagging consists in assigning a tag to each word of a corpus. The choice of which tagset to use depends on the language/application. The input is a string of words and a tagset from use, the output instead is the association of the most appropriate tag to each word. To better understand what a pos tagger of this type is, I recommend this reading: https://www.mygreatlearning.com/blog/pos-tagging/.
The datasets were taken from the web site https://universaldependencies.org/
In this project I have tried different types of training set processing by changing the way of treating words and saving the odds.
- 1: all words are stored in lowercase;
- 2: all words are stored as they are read in the training set without doing any operation.
- 1: P (wi | ti) for a word w_i never encountered with the tag ti in the training set is set to 0. The same is true for the probabilities P(ti | ti - 1) of the tag sequences never encountered;
- 2: a word encountered only with a tag ti ∈ I = {NOUN, ADJ, VERB, ADV, PROPN}, will take ti with a probability of 99% and tj ∈ I \ ti with a probability of 0.25%. A similar argument holds for the probabilities of tag sequences P (ti | ti - 1) as explained above.
- 1: P(word|PROPN) = 1 and P(word|tag != PROPN) = 0
- 2: P(word|ti) = P(ti) where
To run this project you have to set --training-set
and --test-set
parameters that indicate the path of the two
dataset. You can set also the parameter --validation-set
that will be added to the training set.
Other parameters that you can set are:
--storing-data
: words storing method;--storing-prob
: probability storing method;--smoothing
: smoothing method for the unknown words.
Example of command:
python3 app.py --training-set=ud-treebanks-v2.3/UD_English-LinES/en_lines-ud-train.conllu --validation-set=ud-treebanks-v2.3/UD_English-LinES/en_lines-ud-dev.conllu --test-set=ud-treebanks-v2.3/UD_English-LinES/en_lines-ud-test.conllu