Capstone Project - Udacity Data Engineering
Evaluation criteria
- Project code is clean and modular:
- All coding scripts have an intuitive, easy-to-follow structure with code separated into logical functions.
- Naming for variables and functions follows the PEP8 style guidelines.
- The code should run without errors.
- Quality Checks:
- The project includes at least two data quality checks.
- Data Model:
- The ETL processes result in the data model outlined in the write-up.
- A data dictionary for the final data model is included.
- The data model is appropriate for the identified purpose.
- Datasets - project includes:
- At least 2 data sources
- More than 1 million lines of data.
- At least two data sources/formats (csv, api, json)
Preparation for running
Some datasets are not included in this project due to their size. Before running this project, you need to download the datasets!
Ensure the following files exist at right locations:
./data/GlobalLandTemperaturesByCity.csv
from the World Temperature dataset link./data/i94_apr16_sub.sas7bdat
I94 Immigration Data: from the US National Tourism and Trade Office link
Running
To start a Docker container with Jupyter Notebook and Spark, run:
./run_docker.sh
Next, open the Jupyter Notebook in the browser (providing the correct access token):
http://127.0.0.1:8888/notebooks/work/Immigration.ipynb?token=TOKEN
Resources
It is recommended to assign at least 8GB or memory and >=4 CPU cores for Docker!