This program is written and tested with Python 3.11.7
python viterbi.py <train_file> <test_file> <output_file>
The algorithm uses a bigram model to calculate the transition probabilities.
The most standard breakdown for training and test purposes of the Penn Treebank Corpus is:
Sections 02-21 Training Section 23 Test Section 24 Development
The other sections (00, 01, 22) are typically not used, although section 00 has a training/development feel to it (many papers cite examples from 00 files).
There are 2 possible versions of each file:
-
file.pos -- there are two columns separated by a tab: 1st column: token 2nd column: POS tag Blank lines separate sentences.
This is the format of training files, system output, and development or test files used for scoring purposes.
-
file.words -- one token per line, with blank lines between sentences. Format of an input file for a tagging program.
For HW4, we are distributing the following files:
WSJ_02-21.pos -- to use as the training corpus
WSJ_24.words -- to use as your development set (for testing your system)
WSJ_24.pos -- to use to check how well your system is doing
WSJ_23.words -- to run your system on. You should produce a file in the .pos format as your output and submit it as per the submission instructions to be announced.
score.py -- this is a scorer which you can use on your development corpus. The scoring command is:
python3 score.py WSJ_24.pos WSJ_24_sys.pos
assuming that your system output is called WSJ_24_sys.pos
This will give you an accuracy score. For further debugging and tuning, I suggest using the UNIX diff utility.