HMM-Part-of-Speech-Tagger: A Python repository from lesliemanrique1

#Homework Number 4

###Includes Two Files : viterbi_run.py Runs the viterbi algorithm viterbi.py Library of functions

###To run this file:

python viterbi_run.py trainingfile  developmentfile runfile

trainingfile - file used as training corpus WSJ_02-21.pos

developmentfile - file used as development set WSJ_24.pos

runfile - file to run system on

This program produces a file called Leslie_Manrique_WSJ_23.pos.

For this assignment, I implemented a bigram systems, as show in Chapter 5 of Martin and Jurafsky book. The training file and development file are merged to create a larger file that will be outputed as merged.txt.

###OOV Words

The hardest part was handing OOV words. I accomplished this via the following code:

def OOV_tag(word,index,pos_keys):
	OOV_pos = [] 

	#if there's a hyphen , return JJ
	if '-' in word:
		if 'JJ' not in OOV_pos: 
			OOV_pos.append('JJ')

	#if word ends with able, return JJ 
	if 'able' in word:
		if 'JJ' not in OOV_pos: 
			OOV_pos.append('JJ') 
		

	#if word is numnerical, return CD 
	if unicode(word).isnumeric():
		#print(word)
		if 'CD' not in OOV_pos: 
			OOV_pos.append('CD') 
	if float(isfloat(word)):
		#print(word)
		#print(float(word)) 
		if 'CD' not in OOV_pos:
			OOV_pos.append('CD') 
	#if it starts with an uppercase letter and is not found at the beginning of the sentence AND end with an S return NNPS 
	if word[0].isupper() and index > 0 and word[-1] == 's':
		if 'NNPS' not in OOV_pos: 
			OOV_pos.append("NNPS") 

	#if it starts with an uppercase letter and is not found at the beginning of the sentence
	#return NNP
	if word[0].isupper() and index > 0:
		if 'JJ' not in OOV_pos: 
			OOV_pos.append("NNP") 

	#if it ends with an s then return NNS 
	if word[-1] == 's':
		if 'NNS' not in OOV_pos: 
			OOV_pos.append("NNS")  

	#if it ends with ing, return VBG
	if word[-3:] == 'ing':
		if 'VBG' not in OOV_pos: 
			OOV_pos.append("VBG") 
		if 'VB' not in OOV_pos:
			OOV_pos.append("VB") 
	#if it ends with ed, return VBD 
	if word[-2:] == 'ed':
		if 'VBD' not in OOV_pos: 
			OOV_pos.append('VBD') 
	
	#if it ends with ly return RB 
	if word[-2:] == 'ly':
		if 'RB' not in OOV_pos: 
			OOV_pos.append('RB')
	#if it ends with er return JJR
	if word[-2:] == 'er':
		if 'JJR' not in OOV_pos: 
			OOV_pos.append('JJR')  
	#if it ends with est return JJS 
	if word[-3:] == 'est':
		if 'JJS' not in OOV_pos: 
			OOV_pos.append('JJS')  
	if len(OOV_pos) == 0: 
		if 'N' not in OOV_pos: 
			OOV_pos.append('NN')
	return OOV_pos

Default an OOV word is likely to be a nount. As each if statement is checked, when true is returned, it will add the part of speech to a list of parts of speeches.

The likelihood for out of vocabulary words were automatically set to 1/100K

###Problems

I ran into problems with this project. After finishing up my code, there were some transitions that were not found in my transition table. Therefore when the part of speech is selected, there was a gap in my Viterbi lookup table. So, when finding the path, the path was set to default 0, which made the words be tagged to 'S' which i used to signify the start of a sentence.

The algorithm also takes a bit to run, but usually 60 to 80 seconds.

All in all, from my tests using score.py I was able to achieve over 95% accuracy when tagging the file WSJ_24.words.

lesliemanrique1/HMM-Part-of-Speech-Tagger