Welcome to the ETL Movies Data Project! π This project is a deep dive into building an end-to-end ETL (Extract, Transform, Load) pipeline using Python, Docker, and a Google Cloud Pub/Sub emulator. Weβre working with a dataset of movie ratings, transforming it into a format that's ready for analysis, and loading it into a Docker container for easy access and management. This project is perfect for those looking to simulate a real-world ETL process in a local environment. π»
Hereβs how our project is organized:
-
data/
# Where all the magic data lives! π©ratings.csv
# Contains rated movie data πmovies.csv
# Contains information about movies π₯full_data.csv
# The preprocessed movie data file π
-
Dockerfile
# Our recipe for the Docker environment π¦ -
requirements.txt
# All the ingredients (dependencies) π -
README.md
# This very guide youβre reading! π -
Scripts/
# Scripts to automate various tasks πsetup_env.bat
# Environment setup script βοΈdownload_data.bat
# Data download script β¬οΈstart_emulator.py
# Start the Pub/Sub emulator πcreate_topic_subscription.py
# Create Pub/Sub topic and subscription πpublish_test_message.py
# Test data ingestion π§ͺprocess_data.py
# Extract CSV files πpreprocessing_data.py
# Clean up the data π§Όpublish_data.py
# Publish data to the container π
Feel free to explore each part of the project to understand its role and how everything fits together. Happy coding! π©βπ»π¨βπ»
- Python π: The core language for our scripts.
- Docker π³: To containerize and run everything smoothly.
- Google Cloud Pub/Sub βοΈ: For simulating real-time data streaming.
- Pandas πΌ: Our go-to for data manipulation.
- Windows Batch Scripting π₯: Automating the setup and downloads.
- Google Cloud SDK π: For interacting with Google Cloud services.
Ready to get started? Hereβs what you need to do:
-
Install Required Tools:
- Docker Desktop
- Python 3.12.5
- Google Cloud SDK
-
Clone the Repository:
- Use Git to clone the repository and dive into the project directory.
-
Install Python Dependencies:
- Create a virtual environment and install dependencies with a flick of a command.
Hereβs a step-by-step guide to get you through the project:
Goal: Set up the project environment with all the necessary dependencies.
How: Run the setup_env.bat
script, and let the automation magic happen! β¨
Goal: Get the movie ratings data.
How: Simply run download_data.bat
, and the data will be at your service! π₯
Goal: Simulate the Pub/Sub environment locally.
How: Kickstart the emulator with start_emulator.py
. π
Goal: Package everything into a Docker image.
How: Build the image using Docker, and watch it come to life! π
Goal: Spin up the Docker container.
How: Use the Docker run command, and let the container do its thing. πββοΈ
Goal: Create a Pub/Sub topic and subscription.
How: Execute create_topic_subscription.py
, and set the stage for data flow. π
Goal: Ensure data ingestion works smoothly.
How: Run publish_test_message.py
and see the messages flow! π―
Goal: Extract and prepare the data.
How: Run process_data.py
and get your CSVs ready for action! π
Goal: Clean and prep the data.
How: Execute preprocessing_data.py
, and your data will be spotless! π§Ό
- inside this Python file:
- null values have been filled
- datatype correction
- merge data depending on the item_id
Goal: Create a place in the container for our data.
How: Access the Docker terminal and create the /data
folder. π
Goal: Send the data into the Docker container.
How: Run publish_data.py
and watch the data transfer! π
Google Cloud Pub/Sub is all about handling data in real-time, and this project, allows us to simulate how large-scale data processing would work in the cloud. The Pub/Sub emulator lets us develop and test everything locally, so weβre ready for the real cloud when the time comes. βοΈ
- Handling Big Data: Processing 100,000 rows was a challenge, but we conquered it! πͺ
- Local Cloud Simulation: Setting up the Pub/Sub emulator wasnβt easy, but it was worth it. π
- Data Cleaning: Ensuring clean and reliable data required some serious attention to detail. π§Ή
This project fully demonstrates how to build a robust ETL pipeline, complete with Dockerization and cloud simulations. Whether youβre here to learn or to build, this project has all the tools and guidance you need. Happy coding! π©βπ»π¨βπ»
Find everything you need in our GitHub repository. Dive in, explore, and feel free to contribute! π
Contributions are welcomed to this project! If youβd like to contribute or have any questions, please contact:
- Author: Nada Hamdy Fatehy
- Email: nadahamdy2172002@gmail.com
- LinkedIn: LinkedIn
- GitHub: GitHub