/Urdu-POS-Tagger

Part of Speech Tagger (POS) for Urdu Language with Hidden Markov Model (HMM) using Kneser-Ney Smoothing

Primary LanguageJupyter Notebook

This directory contains the implementation of Hidden Markov Moded(HMM) based Part-of-Speech(POS) Tagger using Kneser-Ney Smoothing. All the code is written in Python. Dataset containing Training, Validation and Test data is in the same directory.

Use 'kn_pos.py' file for training of model and getting tag on testing data. It gives a 'tagged_output.txt' file containing word, tag pair of test data in tab separated form (word tag), with each pair on a single line.

To run the 'kn_pos.py' file, use the following command:
python kn_pos.py path/to/trainfile path/to/testfile

It will output a 'tagged_output.txt' in the same directory where 'kn_pos.py' file is located.

To evalute the tags generated by tagger against the correct tags, use the 'evalute.py' file. Run the following command:
python evaluation.py tagged_output.txt path/to/validationfile

It will print out the accuracy.

DATA FORMATS:

Training data should be in tab separated word,tag format:

ٹریور NN
ٹینک NN
مختلف JJ
قسم NN
کی PSP
چڑیوں NN
جیسے PRR

Validation data should be in this format:
ابتدائی JJ
نقصان NN
کے PSP
بعد NN
معین NNP
علی NNP
اور CC
مورگن NNP
نے PSP

Test data should be in this format:

ابتدائی
نقصان
کے
بعد
معین
علی
اور