/viterbi-pos-tagger

Primary LanguagePythonGNU General Public License v2.0GPL-2.0

Part-of-Speech Tagging System with Viterbi Algorithm

This program is written and tested with Python 3.11.7

python viterbi.py <train_file> <test_file> <output_file>

The algorithm uses a bigram model to calculate the transition probabilities.

Task Description

The most standard breakdown for training and test purposes of the Penn Treebank Corpus is:

Sections 02-21 Training Section 23 Test Section 24 Development

The other sections (00, 01, 22) are typically not used, although section 00 has a training/development feel to it (many papers cite examples from 00 files).

There are 2 possible versions of each file:

  1. file.pos -- there are two columns separated by a tab: 1st column: token 2nd column: POS tag Blank lines separate sentences.

    This is the format of training files, system output, and development or test files used for scoring purposes.

  2. file.words -- one token per line, with blank lines between sentences. Format of an input file for a tagging program.

For HW4, we are distributing the following files:

WSJ_02-21.pos -- to use as the training corpus

WSJ_24.words -- to use as your development set (for testing your system)

WSJ_24.pos -- to use to check how well your system is doing

WSJ_23.words -- to run your system on. You should produce a file in the .pos format as your output and submit it as per the submission instructions to be announced.

score.py -- this is a scorer which you can use on your development corpus. The scoring command is:

python3 score.py WSJ_24.pos WSJ_24_sys.pos

assuming that your system output is called WSJ_24_sys.pos

This will give you an accuracy score. For further debugging and tuning, I suggest using the UNIX diff utility.