Spark Structured Streaming Example

Walkthrough Video

Overview

Mock data pipeline which reads a stream of weather data, aggregates it slightly, then saves it to a database.

Architecture

IoT devices --> Kafka --> Spark --> Cassandra

NOTES: "IoT Devices" are represented with the Python script iot_devices.py. This script allows "weather data" to be captured and transmitted from three differnt weather sensors, which are located in Boston, Denver, and Los Angeles.

Qucikstart

Start a Kafka server
- create a topic called weather

Start a Cassandra database

create a keyspace called stuff (SimpleStrategy, replication=1)

 CREATE KEYSPACE stuff
 WITH replication = {'class': 'SimpleStrategy, 'replication_factor' : 1};

create a table called weather with the following schema

 CREATE TABLE weather (
 	uuid uuid primary key,
 	device text,
 	temp double,
 	humd double,
 	pres double
 );

Inside the StreamHandler directory, package up the Scala file:
```
sbt package
```

Then run

spark-submit --class StreamHandler \
--master local[*] \
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.5,\
	com.datastax.cassandra:cassandra-driver-core:4.0.0,\
	com.datastax.spark:spark-cassandra-connector_2.11:2.4.3 \
target/scala-2.11/stream-handler_2.11-1.0.jar

From root directory, start one or more "IoT devices":

./iot_devices.py boston
./iot_devices.py denver
./iot_devices.py losang

select * from weather from CQLSH to see if the data is being processed saved correctly!

For a complete walkthrough, checkout this video

dbusteed/spark-structured-streaming

Spark Structured Streaming Example

Walkthrough Video

Overview

Architecture

Qucikstart