For information on use case and technical architecture, see PDF in slides
directory.
- CDH 5.10, including the following services:
- Impala
- Kafka
- KUDU
- Spark 2.0
- Solr (Optional)
- HUE
- StreamSets Data Collector 2.2.1.0
- Anaconda Parcel for Cloudera
- Run the setup script to install Python libraries
bash setup/install.sh
- Edit Streamsets Configuration in Cloudera Manager:
- Data Collector Advanced Configuration Snippet (Safety Valve) for sdc-env.sh:
export STREAMSETS_LIBRARIES_EXTRA_DIR="/opt/sdclib/"
- Data Collector Advanced Configuration Snippet (Safety Valve) for sdc-security.policy:
grant codebase "file:///opt/sdclib/-" { permission java.security.AllPermission; };
- Restart Streamsets
- Edit config.ini with the desired data generator (# wells, sensors, history etc.) and hadoop settings (kafka and kudu servers)
- Create tables to store sensor data in Kudu and generate static lookup information:
python datagen/historian.py config.ini static
- Generate historic data and store in Kudu
python datagen/historian.py config.ini historic
- Open Kudu web UI and navigate to the tables that were created, extract Impala DDL statements and run them in HUE
- Open Streamsets, import pipeline by uploading pipeline config file: streamsets/historian_ingest.json
- Open Streamsets pipeline, edit constants to indicate kafka broker servers and kudu master server
- Start Streamsets pipeline
- Start sensor data generator:
python datagen/historian.py config.ini realtime
- Solr (not included in git repo)
- Create a collection for the measurements data
- Create a Hue dashboard based on the measurements data
- Add a destination in Streamsets to send measurement data to Solr collection
- Impala
- Run queries in
impala
directory