This project demonstrates an end-to-end ETL (Extract, Transform, Load) pipeline where I extract raw data from Kaggle, clean and transform it using Python, load it into a SQL Server database, and perform data analysis using SQL.
-
Extract:
- Extracted data from the Kaggle API using Python.
-
Transform:
- Cleaned and transformed the raw data using Python libraries like
pandas
andnumpy
. - Applied necessary transformations such as handling missing values, normalizing data, and preparing it for database loading.
- Cleaned and transformed the raw data using Python libraries like
-
Load:
- Loaded the cleaned data into a SQL Server database using
SQLAlchemy
.
- Loaded the cleaned data into a SQL Server database using
-
Data Analysis:
- Conducted in-depth data analysis in SQL Server to uncover trends and insights.
- Used SQL queries to generate reports and derive actionable insights from the data.
-
Python: For data extraction, cleaning, and transformation.
- Libraries:
pandas
,numpy
,SQLAlchemy
,pyodbc
- Libraries:
-
Kaggle API: To extract data from Kaggle datasets.
-
SQL Server: As the database for storing and querying data.
-
SQL: For performing data analysis and generating insights.
- Python 3.8+
- SQL Server
- Kaggle API Key: Set up the Kaggle API for downloading datasets.
git clone https://github.com/your-username/ETL-Pipeline-Project.git
cd ETL-Pipeline-Project
pip install -r requirements.txt
- Install the Kaggle CLI by running the following command:
pip install kaggle
- Place your Kaggle API key (
kaggle.json
) in the appropriate directory:- For Windows:
C:\Users\<username>\.kaggle\kaggle.json
- For Mac/Linux:
~/.kaggle/kaggle.json
- For Windows:
-
Extract the data:
- Run the extraction script to download the dataset from Kaggle:
python extract_data.py
- Run the extraction script to download the dataset from Kaggle:
-
Transform and Load the data:
- Run the transformation and loading script to clean the data and load it into SQL Server:
python transform_load.py
- Run the transformation and loading script to clean the data and load it into SQL Server:
-
Perform Data Analysis:
- SQL queries for data analysis can be found in the
data_analysis.sql
file. You can execute them in SQL Server Management Studio (SSMS).
- SQL queries for data analysis can be found in the
ETL-Pipeline-Project/
│
├── data/ # Contains raw data and cleaned data
├── scripts/ # Python scripts for ETL
│ ├── extract_data.py # Script to extract data from Kaggle
│ ├── transform_load.py # Script to clean and load data into SQL Server
│
├── analysis/ # Contains SQL queries for data analysis
│ └── data_analysis.sql # SQL queries for analysis
│
├── README.md # Project overview and instructions
└── requirements.txt # List of Python dependencies
- Integrated data extraction from external APIs (Kaggle) into an ETL pipeline.
- Applied data cleaning and transformation techniques using Python.
- Gained experience loading data into SQL Server using SQLAlchemy and
pyodbc
. - Conducted efficient data analysis in SQL to uncover meaningful insights.
- Visualize the data using Power BI or Tableau to create insightful dashboards.
- Extend the analysis by integrating more datasets from Kaggle.
Feel free to fork this repository and submit pull requests if you'd like to contribute!
If you have any questions or suggestions, feel free to reach out!
- Email: ajay97kumarr@gmail.com
- LinkedIn: https://www.linkedin.com/in/ajay-kumar-029710129
Feel free to customize the placeholders (your-username
, your-email@example.com
, LinkedIn URL) and tweak the structure to suit your project!