This document describes the implementation of a data pipeline to predict customer returns using machine learning. The pipeline involves data loading, preprocessing, feature engineering, model training, evaluation, and generating summary reports.
- Overview
- Data Loading
- Data Quality Checks
- Feature Engineering
- Data Encoding and Scaling
- Model Training and Evaluation
- Prediction and Evaluation
- Summary Report Generation
This project involves building a machine learning model to predict whether a customer will return a product. The workflow includes loading datasets, performing data quality checks, engineering features, encoding and scaling data, training and evaluating models, and generating predictions. Finally, a summary report is generated with visualizations and statistics.
This class handles loading the training and testing datasets from CSV files.
train_path
(str): Path to the training dataset CSV file.test_path
(str): Path to the testing dataset CSV file.train
(DataFrame): DataFrame containing the training data.test
(DataFrame): DataFrame containing the testing data.
__init__(self, train_path='train.csv', test_path='test.csv')
: Initializes the DataLoader class, loading the train and test datasets.check_data_quality(self)
: Executes all data quality checks on the datasets.check_dates(self)
: Checks for date-related errors.check_strings(self)
: Checks for issues in string columns.check_floats(self)
: Checks for issues in float columns.convert_column_types(self)
: Converts columns to their appropriate data types.load_data(self)
: Returns the training and testing datasets.
This method checks for date-related errors such as incorrect formats and future dates.
This method checks for issues in string columns like missing values and duplicate IDs.
This method checks for issues in float columns like negative values and unreasonable percentages.
This function loads and prepares data for feature engineering.
This function adds calculated columns to the dataset, such as:
msrp
: Calculated MSRP based on Purchase Price and Discount Percentage.RepeatReturnFlag
: Indicates if a customer has multiple returns.MultiItemOrder
: Indicates if an order contains multiple items.Season
: Adds a season column based on the order date.CustomerAge
: Calculates the customer's age at the time of the order.Holiday
: Indicates if the order date is a US federal holiday.DaysSinceFirstOrder
: Days since the customer's first order.CustomerLifetimeValue
: Total purchase value of a customer.OrderFrequency
: Frequency of orders by a customer.ProductReturnRate
: Return rate of products in each department.DayOfWeek
: Day of the week of the order date.RecentReturnRate
: Rolling average of recent returns.PriceSensitivity
: Sensitivity to price discounts.DaysBetweenOrders
: Days between consecutive orders.AvgDaysBetweenOrders
: Average days between orders.
Encodes categorical columns using various methods such as one-hot encoding, label encoding, binary encoding, etc.
Scales continuous columns using StandardScaler
.
Trains and evaluates multiple models using stratified k-fold cross-validation. The models include:
- Random Forest
- Gradient Boosting
- Extra Trees
- XGBoost
- LightGBM
Tunes the best model using RandomizedSearchCV
and saves the trained model.
Saves model performance metrics to a CSV file.
Loads a trained model from a specified path.
- Merges historical datasets for repeat returns and product return rates.
- Adds calculated columns to the test dataset.
- Encodes and scales the test data.
- Generates predictions using the loaded model.
- Saves predictions to a CSV file.
- Plots feature importances of the trained model using
matplotlib
andseaborn
.
- Generates summary statistics for returned and non-returned customers.
- Saves summary statistics to CSV files.
- Plots histograms and bar charts for continuous and categorical features, respectively.
- Saves the plots as PNG files.
- Generates a PDF report with summary statistics and visualizations using
FPDF
.
Ensure that the paths in your configuration or credential files are correctly set according to your execution environment. This step is crucial for the correct functioning of the scripts.
Navigate to the folder that contains your credentials and other configuration files.
Before running any scripts, you need to install the required Python packages. Ensure you have a requirements.txt file in the current directory, then execute the following command:
pip install -r requirements.txt
Run the Python script named trigger.py via the command terminal:
python trigger.py