/Movies-Data-ETL-using-Python-GCP

Developed a comprehensive ETL pipeline for movie data using Python, Docker, and a GCP Pub/Sub emulator. Successfully processed and published the data in a local Docker environment, showcasing advanced data engineering skills.

Primary LanguagePython

πŸŽ₯ ETL Movies Data Project

πŸš€ Project Overview

Welcome to the ETL Movies Data Project! 🌟 This project is a deep dive into building an end-to-end ETL (Extract, Transform, Load) pipeline using Python, Docker, and a Google Cloud Pub/Sub emulator. We’re working with a dataset of movie ratings, transforming it into a format that's ready for analysis, and loading it into a Docker container for easy access and management. This project is perfect for those looking to simulate a real-world ETL process in a local environment. πŸ’»


πŸ—‚ Project Structure

Here’s how our project is organized:

ETL_MOVIES/

  • data/ # Where all the magic data lives! 🎩

    • ratings.csv # Contains rated movie data πŸ“Š
    • movies.csv # Contains information about movies πŸŽ₯
    • full_data.csv # The preprocessed movie data file πŸ—ƒ
  • Dockerfile # Our recipe for the Docker environment πŸ“¦

  • requirements.txt # All the ingredients (dependencies) πŸ› 

  • README.md # This very guide you’re reading! πŸ“š

  • Scripts/ # Scripts to automate various tasks πŸŽ›

    • setup_env.bat # Environment setup script βš™οΈ
    • download_data.bat # Data download script ⬇️
    • start_emulator.py # Start the Pub/Sub emulator πŸš€
    • create_topic_subscription.py # Create Pub/Sub topic and subscription πŸ“
    • publish_test_message.py # Test data ingestion πŸ§ͺ
    • process_data.py # Extract CSV files πŸ“‚
    • preprocessing_data.py # Clean up the data 🧼
    • publish_data.py # Publish data to the container 🚚

Feel free to explore each part of the project to understand its role and how everything fits together. Happy coding! πŸ‘©β€πŸ’»πŸ‘¨β€πŸ’»

πŸ”§ Tools and Technologies

  • Python 🐍: The core language for our scripts.
  • Docker 🐳: To containerize and run everything smoothly.
  • Google Cloud Pub/Sub ☁️: For simulating real-time data streaming.
  • Pandas 🐼: Our go-to for data manipulation.
  • Windows Batch Scripting πŸ–₯: Automating the setup and downloads.
  • Google Cloud SDK 🌐: For interacting with Google Cloud services.

πŸ“‹ Environment Setup

Ready to get started? Here’s what you need to do:

  1. Install Required Tools:

    • Docker Desktop
    • Python 3.12.5
    • Google Cloud SDK
  2. Clone the Repository:

    • Use Git to clone the repository and dive into the project directory.
  3. Install Python Dependencies:

    • Create a virtual environment and install dependencies with a flick of a command.

πŸ“š Project Steps

Here’s a step-by-step guide to get you through the project:

1️⃣ Setup Environment

Goal: Set up the project environment with all the necessary dependencies.

How: Run the setup_env.bat script, and let the automation magic happen! ✨


2️⃣ Download Data

Goal: Get the movie ratings data.

How: Simply run download_data.bat, and the data will be at your service! πŸ“₯


3️⃣ Set Up Pub/Sub Emulator

Goal: Simulate the Pub/Sub environment locally.

How: Kickstart the emulator with start_emulator.py. πŸš€


4️⃣ Build Docker Image

Goal: Package everything into a Docker image.

How: Build the image using Docker, and watch it come to life! πŸ› 


5️⃣ Run Docker Container

Goal: Spin up the Docker container.

How: Use the Docker run command, and let the container do its thing. πŸƒβ€β™‚οΈ


6️⃣ Create Pub/Sub Topic and Subscription

Goal: Create a Pub/Sub topic and subscription.

How: Execute create_topic_subscription.py, and set the stage for data flow. 🌐


7️⃣ Test Data Ingestion

Goal: Ensure data ingestion works smoothly.

How: Run publish_test_message.py and see the messages flow! 🎯


8️⃣ Extract CSV Files

Goal: Extract and prepare the data.

How: Run process_data.py and get your CSVs ready for action! πŸ“‘


9️⃣ Preprocess Data

Goal: Clean and prep the data.

How: Execute preprocessing_data.py, and your data will be spotless! 🧼

  • inside this Python file:
    • null values have been filled
    • datatype correction
    • merge data depending on the item_id

πŸ”Ÿ Create Folder in Docker Container

Goal: Create a place in the container for our data.

How: Access the Docker terminal and create the /data folder. πŸ“‚


1️⃣1️⃣ Publish Data to Container

Goal: Send the data into the Docker container.

How: Run publish_data.py and watch the data transfer! 🚚


πŸ’‘ Understanding Pub/Sub and Its Role

Google Cloud Pub/Sub is all about handling data in real-time, and this project, allows us to simulate how large-scale data processing would work in the cloud. The Pub/Sub emulator lets us develop and test everything locally, so we’re ready for the real cloud when the time comes. ☁️


🚧 Challenges Faced

  • Handling Big Data: Processing 100,000 rows was a challenge, but we conquered it! πŸ’ͺ
  • Local Cloud Simulation: Setting up the Pub/Sub emulator wasn’t easy, but it was worth it. πŸŽ“
  • Data Cleaning: Ensuring clean and reliable data required some serious attention to detail. 🧹

** ⭐ Result**

image


πŸŽ‰ Conclusion

This project fully demonstrates how to build a robust ETL pipeline, complete with Dockerization and cloud simulations. Whether you’re here to learn or to build, this project has all the tools and guidance you need. Happy coding! πŸ‘©β€πŸ’»πŸ‘¨β€πŸ’»


πŸ“‚ Repository

Find everything you need in our GitHub repository. Dive in, explore, and feel free to contribute! 🎁


Contributing

Contributions are welcomed to this project! If you’d like to contribute or have any questions, please contact: