Note: Structure of the Spark Scala applications is taken from following repository (thanks to author: Tim van Cann from GDD)
This repository contains simple "real-time" Spark streaming processing pipeline which produce most popular source airports per time windows.
Processing pipeline has 2 main steps:
- download: we use
PopularAirportsDownloader.scala
which downloads on a local filesystem (TODO: fixme to Minio)routes
dataset. (submit command with:./submitTask0.sh
, check steps bellow). from http URL. This is input dataset for the Spark jobs which are part of next step of the pipeline; - process and store results: at this step we have Spark jobs which aggregates data and provide as output top N airports which have the most destinations (routes) which can be reached by those source airports. For this step we have 3 different spark jobs:
PopularAirports.scala
: a batch Spark job that read in the routes dataset. Create an overview of the top N airports used as source airport. Write the output to a filesystem (submit command with:./submitTask1.sh
).PopularAirportsStream.scala
: Spark streaming application same as previous Spark job using Spark Structured Streaming API of Spark and use the dataset file as a source (submit command with:./submitTask2.sh
).PopularAirportsWindowAgg.scala
: Spark Streaming job where aggregations are done using sliding windows. We can pick any window and sliding interval. The end result is the top N airports used as source airport within each window stored on filesystem (submit command with:./submitTask3.sh
).
All Spark jobs are covered with 2 type of tests:
- Integration tests: we test read and write (I/O part of our apps) to filesystem (and in some test we test structure of the saved data)
- Unit tests: we test logic of transformations which we implement over input data
To run the pipeline you need Docker engine on local machine. You need to clone the repository and build docker image with added all dependencies ih the image.
Process of executing pipeline is manual by executing in sequence few commands for building Docker image and execute Spark jobs.
docker build -t flights-analytics -f Dockerfile .
# docker run in interactive mode
docker run --rm -it flights-analytics:latest
Inside docker container we need to executed following bash commands which submits Spark jobs:
# submit PopularAirportsDownloader job
./opt/scripts/submitTask0.sh
# check for output data from that job on following location
ls /usr/local/data/prod/input/task0-batch/
# submit PopularAirports job
./opt/scripts/submitTask1.sh
# check for output data from that job on following location
ls /usr/local/data/prod/output/task1-batch/
# submit PopularAirportsStream job
./opt/scripts/submitTask2.sh
# check for output data from that job on following location
ls /usr/local/data/prod/output/task2-stream/
# submit PopularAirportsWindowAgg job
./opt/scripts/submitTask3.sh
# check for output data from that job on following location
ls /usr/local/data/prod/output/task3-stream-windowing/
Note: This Spark job should be terminated manually because is constantly waiting for new csv files to be processed after 6th (we have 6 input files created by Task 1) micro batch.
- integrate this app with Minio docker