/GansDataEngineering

Data engineering project demonstrating a fully automated pipeline of web scraping and API calls to develop a continously up-to-date database using MySQL, Pandas, and Google Cloud

Primary LanguagePythonMIT LicenseMIT

Gans Data Engineering

build GitHub Pages Medium.com

Data engineering project with web scraping and API calls using MySQL, Pandas, and Google Cloud.

Project

Gans is a fictional company providing electric scooters for rent. To distribute their scooters for customers efficiently, they need data about the cities of operation, such as population, weather forecast and flight arrival times from nearby airports.

The goal of this project is to establish a data engineering pipeline in the cloud (Google Cloud Platform). Information is collected using web scraping and API calls and continuously updated via cloud scheduling. A set of tables in a relational cloud SQL instance serve the data to be always accessible and up-to-date data.

Pipeline

Resources

To accomplish these tasks, I have compiled the following resources for this project.

  1. A Python package for implementing the pipeline in this repository

  2. A technical documentation describing the Python package and its setup locally and in the cloud

  3. An article on  Medium.com  about establishing a Data Engineering project on the Google Cloud Platform

Medium article

Objectives

  1. Setup local MySQL database
    Tools: MySQL Workbench, Python, SQLAlchemy, mysql-connector-python

  2. Collect static data of cities and airports using web scraping
    Tools: Python, Pandas, BeautifulSoup

  3. Collect dynamic data of weather and flights using web API calls
    Tools: Python, Pandas, Requests

  4. Implement a pipeline locally
    Tools: Python, functions-framework

  5. Deploy the pipeline on Google Cloud Platform
    Tools (Google Cloud Services): Cloud Functions, SQL, Cloud scheduler, Secret Manager

  6. Document the findings by writing an article
    Tools: Medium.com

  7. Write a technical documentation of the implementation
    Tools: MkDocs

Languages, Libraries, and Tools used

  • MySQL Workbench
  • Pandas
  • BeautifulSoup
  • Requests
  • RapidAPI
  • Google Cloud Services
  • MkDocs
  • GitHub Pages
  • Medium.com

Database Schema

Running the Python pipeline the first time fully creates the following SQL schema automatically.

Schema

Repository Structure

├── pipeline                <- Source code of the Python package
│   │
│   ├── database.py         <- Database class as user interface for all operations
│   │
│   ├── create_database.sql <- SQL script for creating the database structure
│   │
│   ├── cities.py           <- Internal functions for collecting static data using web scraping
│   ├── airports.py
│   │
│   ├── weather.py          <- Internal functions for collecting dynamic data using APIs
│   └── flights.py
│   
├── docs                    <- MkDocs documentation of the Python package 'pipeline'
│
├── requirements.txt        <- Dependencies for reproducing the pipeline environment
│
├── example.env             <- Environment variables for sensitive data
│
└── main.py                 <- Google Cloud Functions script