/SABD_project2_BallettiMarino

New York 🗽 school bus 🚌 data stream analysis. Uniroma2, SABD 2019/2020, Marco Balletti and Francesco Marino's second project.

Primary LanguageJava

SABD 2019/2020 second project

Authors: Marco Balletti, Francesco Marino

Project structure descritption

data

Folder containing the input dataset as a CSV file (dataset.csv).

docker-env

Folder containing scripts and file for a container based execution of the project architecture:

  1. start-dockers.sh creates the Kafka Cluster and necessary Kafka topics,
  2. stop-dockers.sh stops and deletes the Kafka Cluster after the created topics deletion and
  3. docker-compose.yml is the Docker Compose file used to create the container infrastructure.

Documentation

Folder containing benchmark results (under Benchmark directory), project report and presentation slides.

Results

Folder containing Flink computation results as CSV files:

  1. query1_daily.csv containing the output of the first query evaluated by daily windows,
  2. query1_weekly.csv containing the output of the first query evaluated by weekly windows,
  3. query1_monthly.csv containing the output of the first query evaluated by monthly windows,
  4. query2_daily.csv containing the output of the second query evaluated by daily windows,
  5. query2_weekly.csv containing the output of the second query evaluated by weekly windows,
  6. query3_daily.csv containing the output of the third query evaluated by daily windows and
  7. query3_weekly.csv containing the output of the third query evaluated by weekly windows.

Results are evaluated from the entire dataset content

src

This directory contains in its subdirectories Java code for:

  1. creation of Kafka Topic producer for input data,
  2. creation of a Flink topology to run a DSP analysis of the three queries,
  3. creation of a Kafka Streams topology to run an alternative DSP analysis of the same three queries and
  4. creation of several Kafka topic consumers for DSP output saving.

Java Project structure description

It is recommended to open the entire directory with an IDE for better code navigation. Java project part was developed using JetBrains' IntelliJ IDEA.

In the main folder there are processing architecture launchers:

flink_dsp package

This package contains classes for queries' topologies building and execution using Flink as DSP framework.

flink_dsp.query1 package

flink_dsp.query2 package

flink_dsp.query3 package

kafka_pubsub package

This package contains configurations for the Kafka publish-subscribe service and classes for Consumers and Producers instantiation:

kafkastreams_dsp package

This package contains classes for queries' topologies building and execution using Kafka Streams as DSP library and the KafkaStreamsConfig.java used to get properties for the stream processing library execution.

kafkastreams_dsp.queries package

This package contains classes for queries' topologies creation:

kafkastreams_dsp.windows package

This package contains custom Kafka Streams windows:

  • CustomTimeWindows.java that is an abstract class representing a generic custom duration time window,
  • DailyTimeWindows.java that implements a daily time window aligned to a given time zone,
  • MonthlyTimeWindows.java that implements a monthly time window (aligned to the first day of a month in a given time zone) and
  • WeeklyTimeWindows.java implementing a weekly time window (starts on Monday and ends on Sunday aligned to a given time zone).

utility package

This package contains classes needed for queries' execution support, in particular:

utility.accumulators package

This package contains classes used as accumulators for both Flink and Kafka Streams processing:

utility.benchmarks package

This package contains utilities for latency and throughput evaluation:

utility.delay package

This package contains utilities for delay string parsing and delay type ranking:

utility.serdes package

This package contains data serialization and deserialization utilities: