This is the capstone project for the Udacity Data Engineering Nanodegree.
The goal of the project is to run aggregate analysis on the demographic information from 2015 and US visitors in 2016. The data is stored in a data lake in AWS S3 and to ETL solution is implemented in Apache Spark. On the final (processed) dataset one can calculate summary statistics such as total population per race per state, number of visitors entering each state per country of resicence or per country of citizenship, number of visitors per visa type in each state, visa duration, etc.
The US Demographics Data comes from the US Census Bureau's 2015 American Community Survey and was downloaded from here. This dataset contains information about the demographics of all US cities and census designated places with a population greater or equal to 65,000. It records information such as age, gender, household, race, and population counts. The dataset is in csv
format and contains 2,891 rows and 12 columns. For a full description of the data see DEMOGRAPHICS_README.txt
The I94 Immigration Data is from 2016 and is downloaded from here. This dataset contains information about non-US residents visiting US during 2015. It records information such as date of entry in US, country of citizenship, country of residence, destination (state), visa type, The original dataset is in sas
format, one file per month. The full dataset contains 40,790,529 rows and 28 columns. For a full description of the data see I94_SAS_README.txt.
Exploratory Data Analysis steps are contained in the eda.ipynb
notebook.
This dataset is clean, meaning that there are very few null
entries (not in columns of interest) and no duplicate rows. The main processing step is mainly to rearrange the data, i.e. pivot the Race column, to obtain a dataset with City and State entries as rows and demographic information as columns.
For this dataset a lot more processing was required. The main steps are outlined below:
- Read data from
sas
format and cast to the correct type - Write data to
parquet
file partitioned byi94mon
(month) to be easily accessible - Replace SAS codes with corresponding information
- Convert SAS time format to date type
- Convert strings to date type and handle corner cases, e.g. year 9999
Map out the conceptual data model and explain why you chose that model.
Column | Type | Description |
---|---|---|
state |
StringType | Full state name |
state_code |
StringType | Two characters state code |
num_cities |
LongType | Number of cities considered in state |
total_pop |
DoubleType | Total population |
amind_pop |
LongType | Total American Indian and Alaska Native population |
asian_pop |
LongType | Total Asian population |
afram_pop |
LongType | Total Black or African-American population |
hispl_pop |
LongType | Total Hispanic or Latino population |
white_pop |
LongType | Total White population |
This data model is different than the raw data by the fact that the Race
column was pivoted so that for each state we get one column per Race
value with it's corresponding population count. The reason to choose this model is that we are interested on state aggregations. A finer aggregation, per city, is also possible but it would require more work to match the demographics and immigration city entries.
Column | Type | Description |
---|---|---|
cicid |
IntegerType | CIC Id number |
i94yr |
IntegerType | Year |
i94mon |
IntegerType | Month |
i94cit |
StringType | Citizenship |
i94res |
StringType | Residence |
i94port |
StringType | Port |
arrdate |
DateType | Arrival Date in US |
i94addr |
StringType | Address in US |
depdate |
DateType | Departure Date from US |
i94bir |
IntegerType | Age of Respondent in Years |
i94visa |
StringType | Visa Category |
count |
IntegerType | 1.0 (used for summary statistics) |
dtadfile |
DateType | Date added to I-94 Files (yyyyMMdd) |
biryear |
IntegerType | 4 digit year of birth |
dtaddto |
DateType | Date to which admitted to US (MMddyyyy) |
gender |
StringType | Non-immigrant sex |
visatype |
StringType | Visa Type |
state |
StringType | Two digit state code |
This data model differs from the raw data by the fact that some columns were dropped, and for the remaining columns the codes were substituted by the explicit values and they were converted to the right type. The selected columns gives enough information to calculate aggregations per state, visa type, destination in US, country of residence, country of citizenship, and calculate statistics such as remaining visa days.
The two datasets are joined on the two digit state code.
The steps necessary to pipeline the data into the chosen data model are as follows:
- Group by
City
andState
- Count number of cities
- Pivot
Race
column - For each
City
andRace
pair gather the correspondingCount
- Read data from
sas
format - Cast
sas
data to the correct type - Write data to
parquet
file partitioned byi94mon
(month) to be easily accessible - Keep only columns of interest
- Replace SAS codes with corresponding information
- Convert SAS time format to date type
- Convert strings to date type and handle corner cases, e.g. year 9999
Build the data pipelines to create the data model.
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
- Integrity constraints on the relational database (e.g., unique key, data type, etc.)
- Unit tests for the scripts to ensure they are doing the right thing
- Source/Count checks to ensure completeness
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.
For this project I chose to implement the data source as a data lake in AWS S3, and the ETL pipeline in Apache Spark. While the same pipeline can be implemented in a traditional data warehouse, this option is not as flexible. In general one needs to know in advance the kind of queries would be ran and based on that implement an operational data model. The architecture might be difficult to adapt to further changes in the requirements.
Using Apache Spark and data lake architecutre we gain a lot of flexibility in the sense that we can work directly on the raw data and benefit from the schema-on-read capabilities. This way we can adapt quickly to requirement changes. One other advantage is that the pipeline can be easily shared with a larger audience due to the spark.sql
engine that allows us to write queries using the DataFrame API or the SQL API.
Given the current scenario, the demographics data could be updated every year, whereas the immigration data can be updated much more frequently, depending on the use case.
As discussed above, Apache Spark makes it easy for us to adapt to changes in requirements and implement new transformations and queries.
This scenario is naturally handled by Apache Spark. We can scale the Spark cluster both horizontally and vertically to handle the new data requirements. This data volume increase is scalable also with the AWS S3 solution.
In this case we would need a scheduling system that would run the pipeline every morning and push the results to a location accessible by the dashboard solution. Apache Airflow would be a good fit for this scenario.
Thius depends on the level of the data the users needs. If they need access to upstream data (raw dataset) to perform custom transformations and queries, Apache Spark could be set up on Yarn or Mesos clusters and everyone could access the resources. If the users need to consume already processed data for various summary statistics and visualizations, a main pipeline should be set up, ran on a schedule as discussed above, that outputs an analytical dataset for consumption in a database or some other resource available for everyone.
You need a Python 3 environment with the packages in requirements.txt
installed.
Additionally, Spark
must be installed.
The exploratory data analysis steps are in eda.ipynb
that you can run using Jupyter notebook.
The data is available in AWS S3 in s3://udacity-dataeng-datasets
. You will need aws cli
installed. To get the data, from this directory run the following command:
mkdir ./data
aws s3 sync s3://udacity-dataeng-datasets ./data/
The notebook queries.ipynb
contains some example queries that can be ran on the final dataset.
To run the ETL pipeline and generate the analytical datasets run:
python main.py