This demo was part of below technical webinar workshops
- "Real Time Monitoring with Hadoop" - Slides and webinar recording are available here
- "Search Workshop" - Slides and webinar recording are available here
Author: Ali Bajwa
With special thanks to:
- Guilherme Braccialli for helping to maintain the code and adding sentiment analysis component
- Tim Veil for developing the original banana dashboard
Purpose: Monitor Twitter stream for S&P 500 companies to identify & act on unexpected increases in tweet volume
-
Ingest: Listen for Twitter streams related to S&P 500 companies
-
Processing:
- Monitor tweets for unexpected volume
- Volume thresholds managed in HBASE
-
Persistence:
- HDFS (for future batch processing)
- Hive (for interactive query)
- HBase (for realtime alerts)
- Solr/Banana (for search and reports/dashboards)
- Audits in Ranger/Solr/Banana
- Authorization policies in Ranger
-
Refine:
- Update threshold values based on historical analysis of tweet volumes
-
Demo setup:
- Either download and start prebuilt VM
- Start HDP 2.3.2 sandbox and run provided scripts to setup demo
-
Previous versions
- Option 1: Setup demo using prebuilt VM based on HDP 2.3 sandbox
- Option 2: Setup demo via scripts on vanilla HDP 2.3.2 sandbox
- Kafka basics - optional
- Setup Eclipse
- Run demo to monitor Tweets about S&P 500 securities in realtime
- Stop demo
- Troubleshooting
- Observe results in HDFS, Hive, Solr/Banana, HBase
- Use Zeppelin to create charts to analyze tweets - optional
- Import data into BI tools - optional
- Other things to try - optional
- Reset demo
- Run demo on cluster
- Download VM from here. Import it into VMWare Fusion and start it up.
- Find the IP address of the VM and add an entry into your machines hosts file e.g.
192.168.191.241 sandbox.hortonworks.com sandbox
- Connect to the VM via SSH (password hadoop)
ssh root@sandbox.hortonworks.com
- Start the demo by
cd /root/hdp22-twitter-demo
./start-demo.sh
#once storm topology is submitted, press control-C
#start kafka twitter producer
./kafkaproducer/runkafkaproducer.sh
-
Observe results in HDFS, Hive, Solr/Banana, HBase
-
Troubleshooting: check the Storm webUI for any errors and try resetting using below script:
./reset-demo.sh
These setup steps are only needed first time and may take upto 30min to execute (depending on your internet connection)
-
While waiting on any step, if you don't already have Twitter credentials, follow steps here to get them
-
Download HDP 2.3.2 sandbox VM image file (Sandbox_HDP_2.3.2_VMWare.ova) from Hortonworks website
-
Import the ova into VMWare Fusion and allocate at least 4cpus and 8GB RAM (its preferable to increase to 9.6GB+ RAM) and start the VM
-
Find the IP address of the VM and add an entry into your machines hosts file e.g.
192.168.191.241 sandbox.hortonworks.com sandbox
- Connect to the VM via SSH (password hadoop). You can also SSH via browser by clicking: http://sandbox.hortonworks.com:4200
ssh root@sandbox.hortonworks.com
- Download code as root user
cd
git clone https://github.com/hortonworks-gallery/hdp22-twitter-demo.git
- Download Ambari service for VNC (details below)
VERSION=`hdp-select status hadoop-client | sed 's/hadoop-client - \([0-9]\.[0-9]\).*/\1/'`
sudo git clone https://github.com/hortonworks-gallery/ambari-vnc-service.git /var/lib/ambari-server/resources/stacks/HDP/$VERSION/services/VNCSERVER
service ambari restart
- Setup demo:Run below to setup demo (one time): it will start Ambari/HBase/Kafka/Storm and install maven, solr, banana.
cd /root/hdp22-twitter-demo
./setup-demo.sh
- while it runs, proceed with installing VNC service per steps below
-
Once the status of HDFS/YARN has changed from a yellow question mark to a green check mark...
-
Setup Eclipse on the sandbox VM and remote desktop into it using an Ambari service for VNC
- In Ambari open, Admin > Stacks and Services tab.
- You can access this via http://sandbox.hortonworks.com:8080/#/main/admin/stack/services
- Deploy the service by selecting:
- VNC Server -> Add service -> Next -> Next -> Enter password (e.g. hadoop) -> Next -> Proceed Anyway -> Deploy
- Make sure the password is at least 6 characters or install will fail
- Connect to VNC from local laptop using a VNC viewer software (e.g. Tight VNC viewer or Chicken of the VNC or just your browser). Detailed steps here
- Import code into Eclipse using "Getting started with Storm and Maven in Eclipse environment" steps here
- Review Storm code in Eclipse under /root/hdp22-twitter-demo/stormtwitter-mvn/src/main/java/hellostorm:
- GNstorm.java: Main class, also where topology, KafkaSpout, HDFSBolts instatiated
- TwitterScheme.java: defines structure of a Tweet
- SolrBolt.java: writes to Solr
- TwitterRuleBolt.java: defines business logic of when a tweet should results in an alert
- In Ambari open, Admin > Stacks and Services tab.
- Twitter4J requires you to have a Twitter account and obtain developer keys by registering an "app". Create a Twitter account and app and get your consumer key/token and access keys/tokens: https://apps.twitter.com > sign in > create new app > fill anything > create access tokens
- Then enter the 4 values into the file below in the sandbox
vi /root/hdp22-twitter-demo/kafkaproducer/twitter4j.properties
oauth.consumerKey=
oauth.consumerSecret=
oauth.accessToken=
oauth.accessTokenSecret=
- Once Kafka is started, run through the below commands to understand the basics
#check if kafka already started
ps -ef | grep kafka
#if not, start kafka
nohup /usr/hdp/current/kafka-broker/bin/kafka-server-start.sh /usr/hdp/current/kafka-broker/config/server.properties &
#create topic
/usr/hdp/current/kafka-broker/bin/kafka-topics.sh --create --zookeeper $(hostname -f):2181 --replication-factor 1 --partitions 1 --topic test
#list topic
/usr/hdp/current/kafka-broker/bin/kafka-topics.sh --zookeeper $(hostname -f):2181 --list | grep test
#start a producer and enter text on few lines
/usr/hdp/current/kafka-broker/bin/kafka-console-producer.sh --broker-list $(hostname -f):6667 --topic test
#start a consumer in a new terminal your text appears in the consumer
/usr/hdp/current/kafka-broker/bin/kafka-console-consumer.sh --zookeeper $(hostname -f):2181 --topic test --from-beginning
#hit Control-C on both terminals to quit the consumer/producer
#delete topic (only works if delete.topic.enable and setup auto.create.topics.enable is set to true in Ambari > Kafka > Config)
/usr/hdp/current/kafka-broker/bin/kafka-topics.sh --delete --zookeeper $(hostname -f):2181 --topic test
-
Sandbox comes with Ranger installed. You can use the below steps to setup Hbase/Hive audits to Solr and setup Silk (banana) dashboard to visualize these
-
Setup Solr and Banana and 'Ranger Audits' dashboard using HDP search (Solr 5.2) - note this will install a view and restart Ambari
cd
wget https://github.com/abajwa-hw/security-workshops/raw/master/scripts/setup_solr_banana.sh
chmod +x setup_solr_banana.sh
# assuming you already created a /etc/hosts entry for sandbox.hortonworks.com on your local laptop, just run below
./setup_solr_banana.sh
# otherwise, pass in appropriate argument as described below
./setup_solr_banana.sh <arguments>
#on sandbox
service ambari start
- argument options:
- if no arguments passed, FQDN will be used as hostname to setup dashboard/view (use this if you have created local hosts entry for host where Solr will run e.g. sandbox.hortonworks.com)
- if "publicip" is passed, the public ip address will be used as hostname to setup dashboard/view (use this on cloud environments)
- otherwise the passed in value will be assumed to be the hostname to setup dashboard/view
- Solr UI should be available at http://(your hostname):6083/solr/#/ranger_audits e.g. http://sandbox.hortonworks.com:6083/solr/#/ranger_audits
- An Empty Banana dashboard should be available at http://(your hostname):6083/banana e.g. http://sandbox.hortonworks.com:6083/banana.
- To manually start this Ranger Solr instance (e.g. after reboot) use below:
/opt/lucidworks-hdpsearch/solr/ranger_audit_server/scripts/start_solr.sh
- Setup HBase Ranger plugin to audit to Solr
cd /usr/hdp/2.*/ranger-hbase-plugin/
vi /usr/hdp/2.*/ranger-hbase-plugin/install.properties
XAAUDIT.SOLR.IS_ENABLED=true
XAAUDIT.SOLR.SOLR_URL=http://sandbox.hortonworks.com:6083/solr/ranger_audits
./enable-hbase-plugin.sh
- Setup Hive Ranger plugin to audit to Solr
cd /usr/hdp/2.*/ranger-hive-plugin/
vi /usr/hdp/2.*/ranger-hive-plugin/install.properties
XAAUDIT.SOLR.IS_ENABLED=true
XAAUDIT.SOLR.SOLR_URL=http://sandbox.hortonworks.com:6083/solr/ranger_audits
./enable-hive-plugin.sh
- Now retart HBase and Hive to register the plugins.
Most of the below steps are optional as they were already executed by the setup script above but are useful to understand the components of the demo:
-
Review the list of stock symbols whose Twitter mentiones we will be tracking http://en.wikipedia.org/wiki/List_of_S%26P_500_companies
-
Generate securities csv from above page and review the securities.csv generated. The last field is the generated tweet volume threshold
/root/hdp22-twitter-demo/fetchSecuritiesList/rungeneratecsv.sh
cat /root/hdp22-twitter-demo/fetchSecuritiesList/securities.csv
- (Optional) for future runs: you can add other stocks/hashtags to monitor to the csv (make sure no trailing spaces/new lines at the end of the file). Find these at http://mobile.twitter.com/trends
sed -i '1i$HDP,Hortonworks,Technology,Technology,Santa Clara CA,0000000001,5' /root/hdp22-twitter-demo/fetchSecuritiesList/securities.csv
sed -i '1i#hadoopsummit,Hadoop Summit,Hadoop,Hadoop,Santa Clara CA,0000000001,5' /root/hdp22-twitter-demo/fetchSecuritiesList/securities.csv
- Open connection to HBase via Phoenix and check you can list tables. Notice securities data was imported and alerts table is empty
/usr/hdp/current/phoenix-client/bin/sqlline.py localhost:2181:/hbase-unsecure
!tables
select * from securities;
select * from alerts;
select * from dictionary;
!q
- check Hive table schema where we will store the tweets for later analysis
hive -e 'desc tweets_text_partition'
- Start Storm Twitter topology to generate alerts into an HBase table for stocks whose tweet volume is higher than threshold this will also read tweets into Hive/HDFS/local disk/Solr/Banana. The first time you run below, maven will take 15min to download dependent jars
cd /root/hdp22-twitter-demo
./start-demo.sh
#once storm topology is submitted, press control-C
- (Optional) Other modes the topology could be started in future runs if you want to clean the setup or run locally (not on the storm running on the sandbox)
cd /root/hdp22-twitter-demo/stormtwitter-mvn
./runtopology.sh runOnCluster clean
./runtopology.sh runLocally skipclean
- If you see errors like below, double check in Ambari that Storm is still up
Caused by: java.net.ConnectException: Connection refused
Could not find leader nimbus from seed hosts [sandbox.hortonworks.com]. Did you specify a valid list of nimbus hosts for config nimbus.seeds
-
open storm UI and confirm topology was created using either:
-
Start Kafka producer: In a new terminal, compile and run kafka producer to start producing tweets containing first 400 stock symbols values from csv
/root/hdp22-twitter-demo/kafkaproducer/runkafkaproducer.sh
-
To stop producing tweets, press Control-C in the terminal you ran runkafkaproducer.sh
-
kill the storm topology to stop processing tweets
storm kill Twittertopology
-
If Storm webUI shows topology errors...
-
Check the Storm webUI for any errors and try resetting using below script:
./reset-demo.sh
- (Optional): In case of Ranger authorization errors, add users to global allow policies
- Start Ranger and login to http://sandbox.hortonworks.com:6080 (admin/admin)
service ranger-admin start
- "HDFS Global Allow": add group root to this policy - by opening http://sandbox.hortonworks.com:6080/#!/hdfs/1/policy/2
- "HBase Global Allow": add group hadoop to this policy - by opening http://sandbox.hortonworks.com:6080/#!/hbase/3/policy/8
- "Hive Global Tables Allow": add user admin to this policy - by opening http://sandbox.hortonworks.com:6080/#!/hive/2/policy/5
- Note you will need to first create an admin user - by opening http://sandbox.hortonworks.com:6080/#!/users/usertab
-
Open the new Storm View and check the statistics for each Bolt: 'Acked' columns should start increasing.
-
The statistics also available via the Storm UI http://sandbox.hortonworks.com:8744/
-
Open Files view and see the tweets getting stored: http://sandbox.hortonworks.com:8080/#/main/views/FILES/1.0.0/Files
- Open Hive table via Hive view. Notice tweets appear in the Hive table that was created: http://sandbox.hortonworks.com:8080/#/main/views/HIVE/1.0.0/Hive
-
Open Banana UI and view/search tweet summary and alerts: http://sandbox.hortonworks.com:8983/solr/banana/index.html
- You can also access the UI via Ambari view by following steps here and replacing the url with http://sandbox.hortonworks.com:8983/solr/banana/index.html
- For more details on the Banana dashboard panels are built, refer to the underlying json file that defines all the panels
- In case you don't see any tweets, try changing to a different timeframe on timeline (e.g. by clicking 24 hours, 7 days etc). If there is a time mismatch between the VM and your machine, the tweets may appear at a different place on the timeline than expected.
-
Run a query in Solr to look at tweets/hashtags/alerts. Click on 'Query' and enter a query under 'q'. Examples are doctype_s:tweet and text_t:AAPL. You can choose an output format under 'wt' and click 'Execute Query'. http://sandbox.hortonworks.com:8983/solr/#/tweets
-
You can also search using Solr's APIs. The below displays all alerts in JSON format http://sandbox.hortonworks.com:8983/solr/tweets/select?q=*%3A*&df=id&wt=json&fq=doctype_s:alert
- The SolrBolt code showing how the data gets into Solr is shown here
-
Open connection to HBase via Phoenix and notice alerts were generated
/usr/hdp/current/phoenix-client/bin/sqlline.py localhost:2181:/hbase-unsecure
select * from alerts
- Notice tweets written to sandbox filesystem via FileSystem bolt
vi /tmp/Tweets.xls
-
Open the Ranger Audits dashboard at http://sandbox.hortonworks.com:6083/banana
-
By default you will see a visualization of both HBase/Hive reads/gets:
-
Change the query filter to "action:write" to search for writes/puts:
-
Now disable the global allow policy on Hbase and Hive and wait 30s:
-
Try running the same query in Hive view. It should fail as unauthorized
-
At this point, you should should see some Hbase audit records with result=0
- Confirm the same by opening the Audit tab of Ranger: http://sandbox.hortonworks.com:6080
- Re-enable the global allow policies.
-
Apache Zeppelin can also be installed on the cluster/sandbox to generate charts for analysis using:
- Spark
- SparkSQL
- Hive
- Flink
-
The Zeppelin Ambari service can be used to easily install/manage Zeppelin on HDP cluster
- Create ORC table and copy the tweets over:
hive -f /root/hdp22-twitter-demo/stormtwitter-mvn/createORC.sql
- View the contents of the ORC table created: http://sandbox.hortonworks.com:8000/beeswax/table/default/tweets_orc_partition_single
- Grant select access to user hive to the ORC table
hive -e 'grant SELECT on table tweets_orc_partition_single to user hive'
- On windows VM create an ODBC connector called sandbox with below settings:
Host=<IP address of sandbox VM>
port=10000
database=default
Hive Server type=Hive Server 2
Mechanism=User Name
UserName=hive
- Import data from tweets_orc_partition_single table into Excel over ODBC Data > From other Datasources > From dataconnection wizard > ODBC DSN > sandbox > tweets_orc_partition_single > Finish > Yes > OK
- Instead of filtering on tweets from certain stocks/hashtags, you can also consume all tweets returned by TwitterStream API and re-run runkafkaproducer.sh Note that in this mode a large volume of tweets is generated so you should stop the kafka producer after 20-30s to avoid overloading the system It also may take a few minutes after stopping the kafka producer before all the tweets show up in Banana/Hive
mv /root/hdp22-twitter-demo/fetchSecuritiesList/securities.csv /root/hdp22-twitter-demo/fetchSecuritiesList/securities.csv.bak
- To filter tweets based on geography open below file and uncomment this line and re-run runkafkaproducer.sh
vi /root/hdp22-twitter-demo/kafkaproducer/TestProducer.java
/root/hdp22-twitter-demo/kafkaproducer/runkafkaproducer.sh
- This empties out the demo related HDFS folders, Hive table, Solr core, Banana webapp and stops the storm topoogy
/root/hdp22-twitter-demo/reset-demo.sh
- If kafka keeps sending your topology old tweets, you can also clear kafka queue
zookeeper-client
rmr /group1
- To run on actual cluster instead of sandbox, there are a few things that need to be changed before compiling/starting the demo:
-
when running kafka/Hbase shell, change the zookeeper connect strings to use localhost instead of sandbox
-
/root/hdp22-twitter-demo/solrconfig.xml: change sandbox reference to HDFS location of solr user e.g. hdfs://summit-twitterdemo01.cloud.hortonworks.com:8020/user/solr
-
/root/hdp22-twitter-demo/default.json: change sandbox to Solr server (e.g. summit-twitterdemo01.cloud.hortonworks.com)
-
/root/hdp22-twitter-demo/stormtwitter-mvn/src/main/java/hellostorm/GNstorm.java: change zookeeper host, NN url, Hive metastore
BrokerHosts hosts = new ZkHosts("localhost:2181”);
String fsUrl = "hdfs://summit-twitterdemo01.cloud.hortonworks.com:8020";
String sourceMetastoreUrl = "thrift://summit-twitterdemo02.cloud.hortonworks.com:9083”;
-
/root/hdp22-twitter-demo/stormtwitter-mvn/src/main/java/hellostorm/SolrBolt.java: change Solr collection url reference and zookeeper host reference
server = new HttpSolrServer("http://summit-twitterdemo01.cloud.hortonworks.com:8983/solr/tweets");
conn = phoenixDriver.connect("jdbc:phoenix:localhost:2181:/hbase-unsecure",new Properties());
-
/root/hdp22-twitter-demo/stormtwitter-mvn/src/main/java/hellostorm/TwitterRuleBolt.java: change Solr collection url reference and zookeeper host reference
conn = phoenixDriver.connect("jdbc:phoenix:localhost:2181:/hbase-unsecure",new Properties());
SolrServer server = new HttpSolrServer("http://summit-twitterdemo01.cloud.hortonworks.com:8983/solr/tweets");
-