Gans Data Engineering

Data engineering project with web scraping and API calls using MySQL, Pandas, and Google Cloud.

Project

Gans is a fictional company providing electric scooters for rent. To distribute their scooters for customers efficiently, they need data about the cities of operation, such as population, weather forecast and flight arrival times from nearby airports.

The goal of this project is to establish a data engineering pipeline in the cloud (Google Cloud Platform). Information is collected using web scraping and API calls and continuously updated via cloud scheduling. A set of tables in a relational cloud SQL instance serve the data to be always accessible and up-to-date data.

Resources

To accomplish these tasks, I have compiled the following resources for this project.

A Python package for implementing the pipeline in this repository
A technical documentation describing the Python package and its setup locally and in the cloud
An article on about establishing a Data Engineering project on the Google Cloud Platform

Objectives

Setup local MySQL database
Tools: MySQL Workbench, Python, SQLAlchemy, mysql-connector-python
Collect static data of cities and airports using web scraping
Tools: Python, Pandas, BeautifulSoup
Collect dynamic data of weather and flights using web API calls
Tools: Python, Pandas, Requests
Implement a pipeline locally
Tools: Python, functions-framework
Deploy the pipeline on Google Cloud Platform
Tools (Google Cloud Services): Cloud Functions, SQL, Cloud scheduler, Secret Manager
Document the findings by writing an article
Tools: Medium.com
Write a technical documentation of the implementation
Tools: MkDocs

Languages, Libraries, and Tools used

MySQL Workbench
Pandas
BeautifulSoup
Requests
RapidAPI
Google Cloud Services
MkDocs
GitHub Pages
Medium.com

Database Schema

Running the Python pipeline the first time fully creates the following SQL schema automatically.

Repository Structure

├── pipeline                <- Source code of the Python package
│   │
│   ├── database.py         <- Database class as user interface for all operations
│   │
│   ├── create_database.sql <- SQL script for creating the database structure
│   │
│   ├── cities.py           <- Internal functions for collecting static data using web scraping
│   ├── airports.py
│   │
│   ├── weather.py          <- Internal functions for collecting dynamic data using APIs
│   └── flights.py
│   
├── docs                    <- MkDocs documentation of the Python package 'pipeline'
│
├── requirements.txt        <- Dependencies for reproducing the pipeline environment
│
├── example.env             <- Environment variables for sensitive data
│
└── main.py                 <- Google Cloud Functions script