Data Analytics Engineering Project: Stadium Data ETL using Docker, Airflow, AWS, BeautifulSoup, Snowflake, and Tableau
This project involves setting up a data pipeline for scraping, transforming, and loading stadium data using Docker, Airflow, AWS (S3), BeautifulSoup, Snowflake, and Tableau with Python and SQL.
The main objective of this project is to gather and process stadium data, leveraging various tools and technologies to manage and process the information. This project showcases my expertise in data engineering and my ability to handle complex data workflows.
Link: https://en.wikipedia.org/wiki/List_of_stadiums_by_capacity
Video Link: https://youtu.be/lFwdFiiomzU
- Airflow Webserver
- Scheduler
- Postgres Image Containers
- File:
stadium_schema_creation_dag.py
- Function: Run schema creation for staging tables.
- File:
wikipedia_stadium_data_pipeline_dag.py
- Function: Scrape data, perform ETL, and insert into the database.
- Run Scraping
- Task: Insert data into S3 bucket.
- Store and manage data inserted.
- Create visualizations and dashboards.
- Connection ID:
snowflake_conn_stadium
- Type: Snowflake
- Description: Snowflake Airflow Connection - Stadium
- Schema:
STADIUM_SCHEMA
- Login:
*********
- Password: (Set your password)
- Account:
****.us-east-2.aws
- Warehouse:
STADIUM_WAREHOUSE
- Database:
STADIUM_DB
- Role:
ACCOUNTADMIN
- Keep S3 policy and staging policy different. Use
stadiums
in S3 andstadium
in code to avoid confusion.
- Issues with DBT image container.
- Docker
- Python 3.x
- pip (Python package installer)
- AWS account
- Snowflake account
- Tableau Desktop account
-
Clone the repository:
git clone https://github.com/BadreeshShetty/DE-Stadiums cd your-repo-name
-
Install dependencies:
pip install -r requirements.txt
-
Set up Docker:
- Ensure Docker is installed and running on your machine.
-
Configure AWS credentials: Update the AWS credentials in the Docker Compose yml file.
-
Run Docker Compose:
docker-compose up
-
Run Airflow: Set up and start the Airflow scheduler and webserver.
airflow db init airflow scheduler airflow webserver
-
Configure Snowflake connection: Update the Snowflake credentials in Airflow.
-
Run the DAGs: Trigger the DAGs from the Airflow web interface.
Detailed documentation of the project can be found Stadium Data Analytics Engineering Project (Docker, Airflow, AWS, Web Scraping, Python, SQL, Snowflake, Tableau)
For any questions or suggestions, feel free to reach out to me at badreeshshetty@gmail.com.