This is the implementation of Sentiment analysis for the movie reviews by Rotten Tomatoes using MapReduce. ------------------ Input -------------------- Place your input files in txt format in /data directory ------------------ HADOOP DIRECTORY SETUP ---- /user/cloudera/QueryIndex make sure this path is setup on hdfs before executing these files as python will places the input files present in ~/data directry to hdfs the QueryIndex directory on hdfs has two directory input and output ie: /user/cloudera/QueryIndex/input /user/cloudera/QueryIndex/output make sure /user/cloudera/QueryIndex this path is setup. The file in config directory will take care of setting up files into it when your run ---------------------------------------------- information about file name convention used: ------------------- To ease the process of running all the files and arranging input and output I have implemented this python script. It even takes care of setting up files on hdfs. It moves the files present in input directory to /user/cloudera/QueryIndex/input on hdfs and then it starts to execute MainDriver java class. Usage: python -------------------- If you wish to execute all files manually then: sudo javac -cp /usr/lib/hadoop/*:/usr/lib/hadoop/client-0.20/* -d Index_classes jar -cvf MainDriver.jar -C Index_classes/ . sudo -u hdfs hadoop jar MainDriver.jar org.myorg.MainDriver /user/cloudera/QueryIndex/input /user/cloudera/QueryIndex/output -------------------
Classify the sentiment of sentences from the Rotten Tomatoes dataset "There's a thin line between likably old-fashioned and fuddy-duddy, and The Count of Monte Cristo ... never quite settles on either side." The Rotten Tomatoes movie review dataset is a corpus of movie reviews used for sentiment analysis, originally collected by Pang and Lee. In their work on sentiment treebanks, Socher et al. used Amazon's Mechanical Turk to create fine-grained labels for all parsed phrases in the corpus. This project presents a chance to benchmark your sentiment-analysis ideas on the Rotten Tomatoes dataset. We have to label phrases on a scale of five values: negative, somewhat negative, neutral, somewhat positive, positive. Obstacles like sentence negation, sarcasm, terseness, language ambiguity, and many others make this task very challenging.