Basic sentiment analysis of realtime tweets using Apache Kafka - queuing service for data streams.
Initialization Steps:
Download and extract the twitter data zip file:
Download the 16M.txt.zip file from here : https://drive.google.com/file/d/1K1ub__1yKOMTSNAp7f_NGiCefT6KjxXM/view
unzip 16M.txt.zip
Start zookeeper service:
$KAFKA_HOME/bin/zookeeper-server-start.sh $KAFKA_HOME/config/zookeeper.properties
Start kafka service:
$KAFKA_HOME/bin/kafka-server-start.sh $KAFKA_HOME/config/server.properties
Create a topic named twitterstream in kafka:
$KAFKA_HOME/bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 -partitions 1 --topic twitterstream
Check what topics you have with:
$KAFKA_HOME/bin/kafka-topics.sh --list --zookeeper localhost:2181
Using the Streaming API:
In order to stream the tweets and push them to kafka queue, we have provided a python script twitter_to_kafka.py.
To stream tweets, we will read tweets from a file and push them to the twitterstream topic in Kafka. Do this by running our program as follows:
$ python twitter_to_kafka.py
Note, this program must be running when you run your portion of the assignment, otherwise you will not get any tweets.
To check if the data is landing in Kafka:
$KAFKA_HOME/bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic twitterstream --frombeginning
Running the Stream Analysis Program (after finishing the project requirements):
$SPARK_HOME/bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.0 twitterStream.py