ETL-Project

U.S. Quality of life by State

ETL

Team Members

  • Harry Feldman
  • Jessica Pardo
  • Andrey Tokarev
  • Raven Washington

Introduction

This project is designed to conduct an ETL process by extracting, transforming, and loading data that pertains to the quality of life in different States of the USA. The purpose of extracting the data will be to create a database for potential future country-wide analysis of housing, healthcare, and other measure of quality of life.

Data Extraction

In this project, CSV datasets were extrated from the following sources:

To complete the process of data extraction and transformation, it was prepared a python file ETL Notebook.

Data Enginering

After extracting the data, we made an Entity-Relationship Diagram (ERD) by using an open-source toolkit called Quick Database Diagrams. The model looks as follows:

ERD

Data Transformation

The transformation of the data include the following workflow:

  • Pandas functions in Jupyter Notebook to transform all CSV files responses.
  • CSV files transformed into a dataframes.
  • Python transformation functions for data cleaning, joining, filtering, and null values removed.
  • Several columns removed
  • Duplicate rows was removed, and successfully managed.

More detail of the transformation of the data can be seen in ETL Notebook.

Data Loading

After the process of extracting and transforming the data, we created a SQL database to load the dababase. First, we made a table schema SQL Table Schema for each of the CSV files saved in the Resources directory.

Schema

Using Python and SQLAlchemy, we loaded our data into the tables into PostgreSQL for population.

Loading