/aws-test

Use-Case: Airline on-time performance

Primary LanguageJava

TODO: Case study Blog and AWS services use

Use-Case: Airline on-time performance

Reference Link: http://stat-computing.org/dataexpo/2009/

Have you ever been stuck in an airport because your flight was delayed or cancelled and wondered if you could have predicted it if you'd had more data? This is your chance to find out.

The data

The data consists of flight arrival and departure details for all commercial flights within the USA, from October 1987 to April 2008. This is a large dataset: there are nearly 120 million records in total, and takes up 1.6 gigabytes of space compressed and 12 gigabytes when uncompressed.

Batch Ingestion & Processing

Hadoop directory Structure to be created.

LayerDirectory PathFile Format
RAW /data/raw/As is (e.g. TXT, CSV, XML, JSON, etc.,)
Decomposed /data/decomposed/ Avro
Modelled /data/modelled/ Parquet
Schema (Meta data) /data/schema/ AVSC schema

Source data details Download the stats created for year 2008 & 2007.
http://stat-computing.org/dataexpo/2009/the-data.html

Supplemental Data:
http://stat-computing.org/dataexpo/2009/supplemental-data.html

Data preparation
Create a Kafka cluster Create the following Topics in Kafka

  • Airports
  • Carriers
  • Planedate
  • OTP
Download the stats created for year 2008 & 2007 and load the data into a Kafka cluster under the relevant topics. Use any options of your choice to load the data to Kafka topics.

###Batch Ingestion (HDFS)
####Raw layer (Store data AS-IS) Consume messages from Airports & Planedate Kafka Topic to HDFS Raw folder Use Spark Streaming to consume messages from Carriers and OTP Kafka Topic to HDFS Raw folder ####Decomposed layer (Append UUID and timestamp to the AS-IS data) For each message in the Airports & Planedate data from raw directory, append UUID and timestamp. For each message in the Carriers & OTP data from raw directory, append UUID and timestamp.

####Modelling and processing Cleanse the data (trim, null, removing duplicates) and load it in Parquet format as modelled using Spark/Scala

####Develop a solution to answer the following questions.

  • Which carrier performs better?
  • When is the best time of day/day of week/time of year to fly to minimise delays? Do older planes suffer more delays?
  • Can you detect cascading failures as delays in one airport create delays in others?
  • Are there critical links in the system?
  • How well does weather predict plane delays?