This project is the final assignment in the Udacity Nanodegree in Data Engineering.
Four datasets are provided and outlined below. The notebook Capstone_Project.ipynb
contains
the overall data exploration and insights into the data pipeline.
The data ETL pipeline is written as an Airflow Directed Acyclic Graph (DAG) which loads data from S3 into tables
hosted on Amazon Redshift.
- Source: US National Tourism and Trade Office link.
- Format:
.sas7bat
- Info: Records of international visitor arrivals by regions and countries, visa type, method of travel, age, state, and port of entry.
- Source: Kaggle link.
- Format:
.csv
- Info: Statistics on world temperature by date, city, country, latitude, and longitude.
- Source: OpenSoft link.
- Format:
.csv
- Info: Demographics fpr US cities with a population >= 65,000.
- Source: Datahub link.
- Format:
.csv
- Info: A table of IATA (3-letter) or ICAO (4-letter) airport codes and corresponding cities
The notebook Capstone_Project.ipynb
contains the writeup for this project. It contains:
- Inital data exploration, and for this project.
- Method for saving the immigration dataset to
.parquet
format. - Inspection of the 4 datasets.
- Overview of the ETL pipeline.
- Some insights from the project.
- Create
.cfg
file for the project. - Run the cells in
Capstone_Project.ipynb
notebook to generate.parquet
files. - Copy data from Udacity workspace to S3
- Create the Redshift Cluster
- Database schema
This config is used to create the redshift_cluster, as well as provide configuration for the copy_data_to_s3.py
script. It should be placed in the project root directory.
The ARN under IAM_ROLE
can be completed after creating the cluster below.
[CLUSTER]
CLUSTER_TYPE=multi-node
NODE_TYPE=dc2.large
NUM_NODES=2
CLUSTER_IDENTIFIER=
DB_NAME=
DB_USER=
DB_PASSWORD=
DB_PORT=5439
[IAM_ROLE]
IAM_ROLE_NAME=
ARN=
[S3]
BUCKET=
[AWS]
KEY=
SECRET=
The notebook contains an exploration of the data and generates .parquet
files for the immigration data.
I created a Python script to upload the unprocessed data from the local Udacity workspace to an S3 Bucket on AWS. This includes the .parquet
files created when the Capstone_Project.ipynb
notebook is run.
To copy the data, from the Udacity workspace, run:
python ./copy_data_to_s3.py
Run the script to create a redshift cluster (make sure that the .cfg
file is complete)
python ./create_clutster.py
Take note of the IAM_ROLE
, ARN
and ENDPOINT
which are needed to complete the .cfg
and to configure Airflow.
Several tables are created in this project. The schema is shown in Figure 1.
The Database comprises:
public.staging_immigration
: immgration datapublic.staging_temperature
: world temperature datapublic.staging_airports
: airport code datapublic.staging_demographics
: city demographic datapublic.dim_i94cit
: city codespublic.dim_i94port
: port of entry codepublic.dim_i94mode
: mode of transport codespublic.dim_i94addr
: state codespublic.dim_i94visa
: vida type codes
Figure 1: The Schema modeled in this project.
The ETL Pipeline is written as an Airflow DAG and shown in Figure 2. The pipeline:
- Creates the dimension tables on Redshift.
- Creates the staging tables on Redshift.
- Copies the data from S3 into the staging tables.
- Copies data from the SAS Labels file into dimension tables.
- Performs Data Quality Checks on tables.
Figure 2: The Airflow DAG ETL Pipeline.
Airflow needs to be configured with the Redshift Connection and AWS Credentials.
Under Admin > Connections > Create New
add the following 2 connections. The entries should
match those used to make the redshift cluster in .cfg
.
Conn ID = aws_credentials
Conn Type = Amazon Web Services
Login = <AWS KEY>
Password = <AWS SECRET KEY>
Conn ID = redshift
Conn Type = Postgred
Schema = <DB_NAME>
Login = <DB_USER>
Password = <DB_PASSWORD>
Port = 5439