Sentiment Analysis

Calculating arbitrary functions on large real-time data sets is a difficult task. In this study a possible approach to problems of this type is shown: as an example task we consider the analysis of the feelings of the tweets. Instead of getting tweets through the Twitter API, we simulate issuing them by reading them from the sentiment140 dataset. In our scenario we create a Lambda Architecture that uses Apache Hadoop for the Batch Layer, Apache Storm for the Speed Layer and HDFS for data management. We use LingPipe to classify tweets using computational linguistics. This architecture allows to harness the full power of a computer cluster for data processing, is easily scalable, and meets low latency requirements for answering queries in real time.

Software requirements

Java JDK: open jdk 11
Apache Hadoop 2.9.2
Apache Storm 2.1.0
HdfsSpout Apache Storm 2.1.0
LingPipe 4.1.2
jfreechart 1.0.1

Dataset

Sentiment140, 1.6 million tweets with annotated sentiment: download

Usage

Download this repo

git clone https://github.com/fedem96/SentimentAnalysis-LambdaArchitecture.git

Run all the process

Import all jar/libraries in your project and test Hadoop configuration. Set up a single node cluster guide Hadoop

Classifier edit configuration and set args[0]=dataset_file and args[1]=file in which save the classifier
Generator edit configuration and set args[0]=dataset_file to generate tweets on HDFS
Batch Layer no parameters need in args. Run after Generator
Speed Layer no parameters need in args. Run after Generator
Query Gui no parameters in args. Run after Batch and Speed layer
Clear no parameters need in args. Used to clear all the directories in the HDFS before running a new simulation