/CS-51-Final-Project

Text editor with English language syntax highlighting

Primary LanguageHTML

CS 51 Final Project: Part-of-Speech Tagger and Highlighter
Contributors: Billy Janitsch, Jenny Liu, Yuechen Zhao, Joy Zheng

----- COMPILATION AND RUNTIME INSTRUCTIONS -----
To build code, go into highest level in repository and run:
javac -classpath . *.java
NOTE: Due to multi-catch exception handling, a JDK of version 7
(i.e. 1.7) or higher is required.

To execute with the GUI, run:
java POSTaggerApp
(NOTE: The tagger only tags complete, i.e. period-terminated, sentences. 
Trailing words will not be tagged.)

To test accuracy using the current datafile, tagset, and simplified tagset
in the folder, run:
java POSTaggerApp test <directoryname>

To view parts of speech without the GUI, run:
java POSTaggerApp view <and then some text here that you want to see tagged.>
(NOTE: the text to be tagged must be period-terminated.)

To create a datafile from the command line, run:
java POSTaggerApp compute <tagset location> <simple tagset location> <corpus directory> <datafile save location>

----- DEBUGGING -----
In particular, the corpus directory may NOT contain any non-corpus files. If it contains any files other than corpus files, you will be prompted for a new corpus directory.

Check for hidden files in your corpus directory. To do so, add the -a flag to ls in terminal. Specifically, for Mac OS users, the Mac Finder has the tendency to create a file named ".DS_Store" in the corpus directory. Remove this file with terminal if this is the case.

----- DEFAULT OR INCLUDED FILE NAMES (see javadoc for details) -----
Files modified during runtime:
datafile.txt : The file of compiled probabilities
corpus_tagset.txt : The list of tags for Viterbi
corpus_simple_tagset.txt : A list mapping tags in the full tagset
	to a smaller list of tags for highlighting 

Included default files:
/corpus : a directory containing the Brown Corpus
/backup : these are default backup files for use in case of user error. 
	Any changes to these files is liable to cause operation problems
	with the program.
webster_ dictionary.txt : an English-language dictionary based on the one 
	found at http://www.gutenberg.org/files/29765/29765-8.txt
thesaurus.txt: an Engligh-language thesaurus based on the one found at 
 	http://www.gutenberg.org/cache/epub/10681/pg10681.txt