/ETL-Project

In this project, we built a database that demonstrates the changes in American top fastest-growing private companies through time. The database is built on by ingesting, combining, and restructuring data from three main data sources into a conformed one Postgresql database, and deploy into the Flask app.

Primary LanguageHTML

ETL-Project

Unveil the Top Fastest Growing Private Companies in America for the Last Thirteen Years (2007 - 2020)

ETL_project

Introduction

This project is designed to conduct a presentation of business information or Business Intelligence by extracting, transforming, and loading the top fastest-growing private companies in America for the last thirteen years (2007-2020).

The purpose of this project was to build a database that demonstrates the changes in American top fastest-growing private companies through time. The database is built on by ingesting, combining, and restructuring data from three main data sources into a conformed one Postgresql database, and deploy in to Flask app. The three sources of our data are Inc 5000, Financial times ranking 500 2020, and Growjo Company API for scraping fastest-growing companies in 2020.

Data Extraction

In this project we extracted, transformed, and loaded thirteen years (2007-2020) American top fastest-growing private companies.

Our main sources:

Data Enginering

After extracting the data we conducted a data engineering, and Entity-Relationship Diagram (ERD) by using an open-source toolkit called Quickdatabasediagrams. The model looks as follows: ETL_project

Data Transformation

  • We used a Pandas functions in Jupyter Notebook to transform all CSV files, scraped data, and API request responses.
  • We reviewed the files and transformed into a dataframes.
  • We used a python transformation functions for data cleaning, joining, filtering, and aggregating.
  • Several columns removed
  • Duplicate rows was removed, and successfully managed.
  • We conducted some aggregation to find totals for comparison in the datasets.

Load

  • For our final production, we used a relational database called Postgresql, in a total of three tables, twelve columns created, and the data loaded successfully. A flask app is also created for any one to get access the data.
  • Final tables/collections are stored in the production database

Company Table

id company_name number_of_employees industry city state country
0 20391 SwanLeap 49.0 Logistics & Transportation Madison WI United States
1 16357 PopSockets 118.0 Consumer Products & Services Boulder CO United States
2 9922 Home Chef 865.0 Food & Beverage Chicago IL United States
3 22829 Velocity Global 55.0 Business Products & Services Denver CO United States
4 5896 DEPCOM Power 104.0 Energy Scottsdale AZ United States

Ranks Table

id rank rank_year
0 20391 1 2018
1 16357 2 2018
2 9922 3 2018
3 22829 4 2018
4 5896 5 2018

Aggregate

Total number of companies

Total Companies
0 24115

Aggrigate plot on High growth American private company enteries

aggrigate_image

Deploy in to flask app

We used Postgresql with Flask templating to create a new HTML page that displays information about our project work.

  • We created a root route / which serve as a home page
  • We created a route called /companies that will displaye the json file for companies list.

Finally, we created a template HTML file called index.html that take companies information, and displayed them.

Flask_app

Team members

  • Adedamola Atekoja (‘Damola)
  • Amanda Qianyue Ma
  • Amos Johnson
  • Ermias Gaga
  • Maria Lorena