The IBM Cloud Streaming Retail Demo showcases some data and analytics technologies on the IBM Cloud. Some of the technologies include:
- IBM Message Hub (Kafka)
- IBM Analytics Engine (Spark Structured Streaming)
- IBM Cloud Foundry
- IBM Compose ScyllaDB (Cassandra)
- IBM Compose Elasticsearch
- IBM Cloud Object Storage
- Machine Learning (Spark ML, Scikit Learn)
The demo code is all contained within this GitHub repository's parent GitHub organisation ibm-cloud-streaming-retail-demo. The parent repository contains a number of GitHub repositories all focused on different aspects of the solution. The GitHub repositories are described below:
- dataset-generator This repository is responsible for generating the main retail dataset for the demo. You should start with this project to generate the dataset that you will need for the other projects.
- kafka-producer-for-simulated-data This repository is responsible for sending the dataset generated by the dataset-generator project to IBM Message Hub (Kafka)
These two are a work in progress (just need documentation updating) ...
- spark-structured-streaming-on-iae-to-cos save the kafka data stream to IBM Cloud Object Storage (COS) using Apache Spark on IBM Analytics Engine
- spark-structured-streaming-on-iae-to-elasticsearch save the kafka data stream to IBM Compose Elasticsearch using Apache Spark on IBM Analytics Engine
This one is a work in progress (works on standalone spark, but not on IAE) ...
- spark-structured-streaming-on-iae-to-scylladb save the kafka data stream to IBM Compose ScyllaDB using Apache Spark on IBM Analytics Engine
More coming soon ...
- IBM Cloud SQL Query periodically convert json in landing zone from spark-structured-streaming-on-iae-to-cos to partitioned parquet/ORC to support hive queries
- Looker report on hive data populated by IBM Cloud SQL Query or directly in landing zone
- Cognos report on hive data populated by IBM Cloud SQL Query or directly in landing zone
- spark-structured-streaming-on-iae-to-hbase https://stackoverflow.com/a/49450254/1033422
- spark-structured-streaming-on-iae-to-phoenix jdbc sink? https://stackoverflow.com/q/45373795/1033422
- Realtime reporting dashboard using data in hive
- Compose PostgreSQL sink https://stackoverflow.com/q/45373795/1033422
- spark structured streaming + hive streaming https://github.com/jerryshao/spark-hive-streaming-sink
This project is based on this dataset:
Daqing Chen, Sai Liang Sain, and Kun Guo, Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining, Journal of Database Marketing and Customer Strategy Management, Vol. 19, No. 3, pp. 197–208, 2012 (Published online before print: 27 August 2012. doi: 10.1057/dbm.2012.17).
More information on the dataset can be found in the dataset-generator project.