/data-engineering-capstone-2020

Project for submission in the Data Engineering Nano-degree (please excuse the abundance of typos and low-quality python code)

Primary LanguageJupyter Notebook

Capstone Project - Udacity Data Engineering

Evaluation criteria

  1. Project code is clean and modular:
    1. All coding scripts have an intuitive, easy-to-follow structure with code separated into logical functions.
    2. Naming for variables and functions follows the PEP8 style guidelines.
    3. The code should run without errors.
  2. Quality Checks:
    1. The project includes at least two data quality checks.
  3. Data Model:
    1. The ETL processes result in the data model outlined in the write-up.
    2. A data dictionary for the final data model is included.
    3. The data model is appropriate for the identified purpose.
  4. Datasets - project includes:
    1. At least 2 data sources
    2. More than 1 million lines of data.
    3. At least two data sources/formats (csv, api, json)

Preparation for running

Some datasets are not included in this project due to their size. Before running this project, you need to download the datasets!

Ensure the following files exist at right locations:

  • ./data/GlobalLandTemperaturesByCity.csv from the World Temperature dataset link
  • ./data/i94_apr16_sub.sas7bdat I94 Immigration Data: from the US National Tourism and Trade Office link

Running

To start a Docker container with Jupyter Notebook and Spark, run:

./run_docker.sh

Next, open the Jupyter Notebook in the browser (providing the correct access token):

http://127.0.0.1:8888/notebooks/work/Immigration.ipynb?token=TOKEN

Resources

It is recommended to assign at least 8GB or memory and >=4 CPU cores for Docker!