The project aims at building a data platform for real time moderation and analytics of twitter data. The implementation will utilize different big data technologies as Spark, Kafka and Hive, in addition to PowerBI visualization tool for data discovery and delivering insights.
- Create a Twitter Developer Account in order to access Twitter API (GET, POST) and get your credentials. (Link > https://developer.twitter.com/en )
- I was workong on Hortonworks HDP 2.6.5 virtual machine and set up the required envronmet # Please check Environment Setup.txt
- Using Mobaxterm SSH Session:
- Deal with Kafka:
cd /usr/hdp/current/kafka-broker bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic
bin/kafka-topics.sh --list --zookeeper localhost:2181
bin/kafka-console-producer.sh --broker-list 172.18.0.2:6667 --topic
bin/kafka-console-consumer.sh --bootstrap-server 172.18.0.2:6667 --topic --from-beginning
ntpdate -u time.google.com
python Ingesting_Real_Time_Tweets_Using_Tweepy_Kafka_Python.py
**Untill now We are producing tweets from Twitter, Save it to a Kafka topic, waiting to be consumed by our next componet >> Spark and Spark Streaming
- Deal with Spark
Run our processing component to do the following >> Sentiment Analysis Using TextBlob lib in python, Reply to tweets based on sentiment and finally store the output on HDFS parquet format
spark-submit --packages org.apache.spark:spark-streaming-kafka-0-10_2.11:2.4.6,org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.6 --master local[*] /root/Spark_Streaming_Processing.py
- Deal with Parquet
Spark is responsible of dealing with this part, check the previous script.
- Deal with Hive
hive create database bigdata; use bigdata; create external table bigdata.tweets(user_id string,tweet_timestamp string,followers_count string,location string, text string, retweet_count string,tweet_id string,user_name string, polarity string, subjectivity string,sentiment string) stored as parquet location 'hdfs://sandbox-hdp.hortonworks.com:8020/casestudy';
select * from tweettable;
- Deal with ODBC and PowerBI
Download and Setup ClouderaHiveODBC64 https://drive.google.com/file/d/16FJcIBEM-lw_XmlZpU3KpQcfw88JIBWI/view?usp=sharing
After Installing 1- open ODBC data sources 2- choose system DSN 3- choose button Add 4- Data Source Name: set any_name 5- Host: localhost 6- Database: bigdata "same database name in hive" 7- Hive Server Type: choose Hive Server 2 8- Test and Ok 9- Connection Successfuly
Download PowerBI Desktop https://www.microsoft.com/en-us/download/details.aspx?id=58494
After Installing 1- open PowerBI Desktop 2- choose Get Data 3- choose ODBC ,then connect 4- choose Data Source Name "same name you set it in system DSN" ,then OK 5- choose Hive --> bigdata ---> tweets ---> Load
6- Start building your dashboard