Spark Structured Streaming Example

Walkthrough Video

Watch walkthrough video here


Mock data pipeline which reads a stream of weather data, aggregates it slightly, then saves it to a database.


IoT devices --> Kafka --> Spark --> Cassandra  

NOTES: "IoT Devices" are represented with the Python script This script allows "weather data" to be captured and transmitted from three differnt weather sensors, which are located in Boston, Denver, and Los Angeles.


  1. Start a Kafka server
    • create a topic called weather
  2. Start a Cassandra database
    • create a keyspace called stuff (SimpleStrategy, replication=1)
       CREATE KEYSPACE stuff
       WITH replication = {'class': 'SimpleStrategy, 'replication_factor' : 1};
    • create a table called weather with the following schema
       CREATE TABLE weather (
       	uuid uuid primary key,
       	device text,
       	temp double,
       	humd double,
       	pres double
  3. Inside the StreamHandler directory, package up the Scala file:
    sbt package
  4. Then run
    spark-submit --class StreamHandler \
    --master local[*] \
    --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.5,\
    	com.datastax.spark:spark-cassandra-connector_2.11:2.4.3 \
  5. From root directory, start one or more "IoT devices":
    ./ boston
    ./ denver
    ./ losang
  6. select * from weather from CQLSH to see if the data is being processed saved correctly!

For a complete walkthrough, checkout this video