Twitter Streaming with Apache Spark, Apache Kafka, Hive, Hbase, SparkQL, and Tableau.
-
Getting Twitter API keys
- Create a twitter account if you do not already have one.
- Go to https://apps.twitter.com/ and log in with your twitter credentials.
- Click "Create New App"
- Fill out the form, agree to the terms, and click "Create your Twitter application"
- In the next page, click on "API keys" tab, and copy your "API key" and "API secret".
- Scroll down and click "Create my access token", and copy your "Access token" and "Access token secret".
-
Open Terminal and start Kafka server:
cd /opt/kafka_2.13-2.6.2/
bin/kafka-server-start.sh config/server.properties
-
Execute TweetProducer.java of twitter-kafka project.
-
Execute JavaSparkApp.java of spark-streaming project.
-
Start
hive
and create new table:CREATE EXTERNAL TABLE tweet_data_table name STRING, country STRING, followers STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ('separatorChar' = ':', 'quoteChar' = '\') LOCATION '/user/cloudera/Tweets' TBLPROPERTIES ('hive.input.dir.recursive'='ture', 'hive.supports.subdirectories'='true', 'hive.supports.subdirectories'='true', 'mapreduce.input.fileinputformat.input.dir.recursive'='true');
-
Create internal table
CREATE TABLE report (name STRING, followers STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ":";
-
Load data into internal table
LOAD DATA LOCAL INPATH '/home/cloudera/Tweet' OVERWRITE INTO TABLE report;
- Sometime Hbase service dead and you must be restart by commands:
sudo service hbase-master restart; sudo service hbase-regionserver restart;
- Cloudera https://www.cloudera.com/
- Apache Spark-Streaming: https://spark.apache.org/streaming/
- Apache SQL: http://spark.apache.org/sql/
- Apache Kafka: https://kafka.apache.org/
- Apache Hive https://hive.apache.org/
- Apache HBase https://hbase.apache.org/
- Tableau https://www.tableau.com/