End-to-End ETL Pipeline Project

Overview

This project demonstrates an end-to-end ETL (Extract, Transform, Load) pipeline where I extract raw data from Kaggle, clean and transform it using Python, load it into a SQL Server database, and perform data analysis using SQL.

Project Workflow

  1. Extract:

    • Extracted data from the Kaggle API using Python.
  2. Transform:

    • Cleaned and transformed the raw data using Python libraries like pandas and numpy.
    • Applied necessary transformations such as handling missing values, normalizing data, and preparing it for database loading.
  3. Load:

    • Loaded the cleaned data into a SQL Server database using SQLAlchemy.
  4. Data Analysis:

    • Conducted in-depth data analysis in SQL Server to uncover trends and insights.
    • Used SQL queries to generate reports and derive actionable insights from the data.

Technologies Used

  • Python: For data extraction, cleaning, and transformation.

    • Libraries: pandas, numpy, SQLAlchemy, pyodbc
  • Kaggle API: To extract data from Kaggle datasets.

  • SQL Server: As the database for storing and querying data.

  • SQL: For performing data analysis and generating insights.

Installation and Setup

Prerequisites

  • Python 3.8+
  • SQL Server
  • Kaggle API Key: Set up the Kaggle API for downloading datasets.

Clone the Repository

git clone https://github.com/your-username/ETL-Pipeline-Project.git
cd ETL-Pipeline-Project

Install Required Python Libraries

pip install -r requirements.txt

Kaggle API Setup

  1. Install the Kaggle CLI by running the following command:
    pip install kaggle
  2. Place your Kaggle API key (kaggle.json) in the appropriate directory:
    • For Windows: C:\Users\<username>\.kaggle\kaggle.json
    • For Mac/Linux: ~/.kaggle/kaggle.json

Running the Project

  1. Extract the data:

    • Run the extraction script to download the dataset from Kaggle:
      python extract_data.py
  2. Transform and Load the data:

    • Run the transformation and loading script to clean the data and load it into SQL Server:
      python transform_load.py
  3. Perform Data Analysis:

    • SQL queries for data analysis can be found in the data_analysis.sql file. You can execute them in SQL Server Management Studio (SSMS).

Project Structure

ETL-Pipeline-Project/
│
├── data/               # Contains raw data and cleaned data
├── scripts/            # Python scripts for ETL
│   ├── extract_data.py     # Script to extract data from Kaggle
│   ├── transform_load.py   # Script to clean and load data into SQL Server
│
├── analysis/           # Contains SQL queries for data analysis
│   └── data_analysis.sql   # SQL queries for analysis
│
├── README.md           # Project overview and instructions
└── requirements.txt    # List of Python dependencies

Key Learnings

  • Integrated data extraction from external APIs (Kaggle) into an ETL pipeline.
  • Applied data cleaning and transformation techniques using Python.
  • Gained experience loading data into SQL Server using SQLAlchemy and pyodbc.
  • Conducted efficient data analysis in SQL to uncover meaningful insights.

Next Steps

  • Visualize the data using Power BI or Tableau to create insightful dashboards.
  • Extend the analysis by integrating more datasets from Kaggle.

Contributing

Feel free to fork this repository and submit pull requests if you'd like to contribute!

Contact

If you have any questions or suggestions, feel free to reach out!


Feel free to customize the placeholders (your-username, your-email@example.com, LinkedIn URL) and tweak the structure to suit your project!