SenatorialSpeechInvestigation: A Java repository from cestella

If you came here from the CHUG, you might want to see the slide deck here #Introduction This is an illustrative project intended to demonstrate a simple, yet not quite as simple as Word Count, NLP algorithm implemented in Hadoop Map Reduce. This is also intended to demonstrate my opinions on unit testing Map Reduce jobs.

#Description The purpose of the project is to take a dataset of around 1400 speeches from Senators along with an ideal-point mapping onto the number line of the politicians based around the work of Simon Jackman to do the following:

Find partition points separating Conservative, Moderate and Liberal senators by dividing the probability density into thirds (see histogram)
Use this partition to construct three corpuses associated with Conservative, Moderate and Liberal senators
Preprocess using a Porter Stemmer and rank the tokenized terms by Inverse Document Frequency
Take the ranked terms and output the terms that are important in the corpus associated with one political orientation but not the others.

##Partition The partition of the probability space is split into even thirds as can be seen here:

##Output The results, for the curious, are here:

LIBERAL	MODERATE	CONSERVATIVE
market	produce	discuss
power	commission	past
institutions	strengthening	politics
initiate	provisions	pointed
implement	subcommittee	direct
establish	impact	different
history	north	account
available	managers	failed
school	sure	debate
according	hard	instead
risk	sent	rates
measures	defense	reason
children	fiscal	congressional
expand	honorable	budgeting
trained	ability	question

#Current Caveats

A naive partition based on an even partition of the probability space may be unsuitable. I didn't know of a better, politically agnostic way to do this.
An analysis based on n-grams would be more appropriate or at least a more intelligent chunking algorithm so that words like "north korea" are not split.
This is not as efficient as it could be. I could have used term IDs instead of the actual terms, but I wanted the Map Reduce job to be as clear as possible

This code is mostly an intellectual lark and a demonstration of NLP done using Map Reduce that does not require the baggage of Mahout.

#Usage ##Prerequisites To execute this, you must have:

A JDK installed
Maven 2 or higher installed

To generate the histogram, you need to have R. To generate the presentation from the Cleveland CHUG, you need latex

##To generate the full example From the command line, in the political-nlp-analysis directory, execute

 mvn integration-test

You can then use the political-nlp-analysis/generate_top_words.sh script to generate the top words for a particular political orientation thusly:

 generate_top_words.sh {LIBERAL, MODERATE, CONSERVATIVE} <number of terms>

You can reconstruct the density plot using the political-nlp-analysis/src/main/R/generate_histogram.sh script:

 generate_histogram.sh <path to ideal_points.csv>

#Contact

If anyone has comments, concerns or criticisms, please let me hear about it at cestella@gmail.com

cestella/SenatorialSpeechInvestigation