News-Classifier: A Ruby repository from brendte

Installation Instructions:

1.       This is a web application developed in Ruby on Rails and running on the Heroku hosting platform.  No installation is required.
2.	Use the app at http://strong-mist-7332.herokuapp.com
3.	I have included only the code related to the NB classifier in the submission file.  The full code can be accessed on github at https://github.com/brendte/News-Classifier
4.	To use, paste the URL to any news article (or other web content) and click Submit.  The application will retrieve (using the public Alchemy API at http://www.alchemyapi.com/) the main text of the article and categorize it as “You will Like” or “You will Dislike.”

Code Walkthrough

1.	The application uses the Naïve Bayes Classifier algorithm detailed in Mitchell (p. 183) to classify a corpus of news articles into the categories of “You will Like” and “You will Dislike.”  
2.	The training data has been populated by retrieving ~100 articles in each of  ~40 categories (for a total of ~4000 articles) from the Feedzilla news syndication service using their public REST API ( http://code.google.com/p/feedzilla-api/wiki/RestApi).  The articles have initially been randomly categorized into “Like” or “Dislike” classes. (seeds.rb lines 27 – 72).
3.	After all of the articles are seeded, I manually created an instance of the NB class, which contains all of the Naïve Bayes algorithm code (see file nb.rb).  The trains on all of the existing text in the database, stores all of the calculated values in attributes, and then is marshaled to the DB for long term storage and later use in classifying new articles.
a.	 Lines 5 – 24 initialize the class attributes used to hold the text, values, and probabilities.  Particularly important are the hashes @vocabulary and @p.  
i.	@vocabulary holds all of the words that appear in the entire set of examples, as well as the number of occurrences of the word in each of the two classes (stored internally as true and false).  It takes the form {:<word> => { :true => <count>, :false => <count>}
ii.	@p holds all of the calculated a posteriori probabilities for each word, for each class.  It takes the form {:<word> => { :true => <probability>, :false => <probability>}
b.	The slice_n_dice method (lines 27 – 49) handles normalizing the text, tokenizing it, and counting each occurrence of each word.
c.	The calculate_prob method (lines 51 – 64) uses the data collected in slice_in_dice to calculate all of the necessary probabilities.  It uses the formula on in Mitchell (p. 183).
4.	When a user enters a URL, the system calls the Alchemy API to extract the text of the article and return it to the system.  The text is then normalized and tokenized (exactly the same way the training examples were handled).  After that, the application classifies the new instance using the formula in Mitchell (p. 183).  The values used in the calculation are stored in the previously-created NB object (in the @p_true, @p_false attributes and the @p hash), which is de-serialized from the database just-in-time. 

Issues

1.	Currently, the classification is always Dislike, due to a couple of issues:
a.	Small data set (and small number of words per training example) leads to very small a posteriori probabilities of individual words.  This causes the probability calculations to essentially go to zero, which then defaults the category to Dislike.
b.	Randomization of initial classification has led to a lack of sufficient clustering of attributes in a particular class, which also leads to extremely small probabilities. 


Ideas for Enhancement

1.	Build a substantially larger corpus of longer articles (the current training corpus has mostly very short articles) in order to provide much more data for the NBC to work on.
2.	Add users so that individual users can build their own sets of Likes and Dislikes in order to get personalized classification.
3.	Remove the randomized initial classification, which is completely arbitrary and leads to poor classification.  Having each user initially categorize a small subset of articles, and then encouraging them to continue categorizing new articles should replace this randomization.
4.	Add the concept of user classes, so that Like and Dislike classification for a particular user can take into account the classifications yielded for other users that are “similar” to them (perhaps KNN could be used here?).
5.	Add in calculation of probabilities in text that includes numeric values (using the method posted by the instructor—I planned to do this, but ran out of time).