This project aims to compare the performance of various classification algorithms to predict whether a client will subscribe to a long-term deposit based on data from multiple marketing campaigns conducted by a Portuguese banking institution. The classifiers compared in this project include K-Nearest Neighbors (KNN), Logistic Regression, Decision Trees, and Support Vector Machines (SVM).
- Business Understanding
- Data Understanding
- Data Preparation
- Model Training
- Evaluation
- Deployment
- Usage
- Repository Structure
- Notebook
The primary objective of this project is to optimize the direct marketing efforts of the bank by accurately predicting whether a client will subscribe to a long-term deposit. By achieving this, the bank can improve the efficiency of its marketing campaigns, reduce costs, and enhance customer targeting, leading to higher profitability and better resource allocation.
- Source: UCI Machine Learning Repository, link
- Content: The dataset 'bank-additional-full.csv' comprises data collected from 17 marketing campaigns conducted between May 2008 and November 2010, resulting in 41,188 records and 20 inputs.
- Attributes: The dataset includes various features related to client demographics, previous campaign interactions, and economic indicators.
-
Client Data:
age
: Age of the clientjob
: Job type (categorical)marital
: Marital status (categorical)education
: Education level (categorical)default
: Has credit in default? (categorical)housing
: Has housing loan? (categorical)loan
: Has personal loan? (categorical)
-
Contact Data:
contact
: Communication type (categorical)month
: Last contact month (categorical)day_of_week
: Last contact day of the week (categorical)duration
: Last contact duration in seconds (numeric)
-
Campaign Data:
campaign
: Number of contacts performed during this campaign (numeric)pdays
: Number of days since the client was last contacted (numeric)previous
: Number of contacts before this campaign (numeric)poutcome
: Outcome of the previous campaign (categorical)
-
Economic Indicators:
emp.var.rate
: Employment variation rate (numeric)cons.price.idx
: Consumer price index (numeric)cons.conf.idx
: Consumer confidence index (numeric)euribor3m
: Euribor 3 month rate (numeric)nr.employed
: Number of employees (numeric)
-
Target Variable:
y
: Subscription to a term deposit (binary: 'yes' or 'no')
unknown
values in categorical variables such asjob
,education
,default
,housing
, andloan
represent missing data.999
inpdays
indicates that the client was not previously contacted.
-
Handling Missing Values:
- No missing values.
-
Data Cleaning:
- Removed 12 duplicate rows from the dataset.
-
Correlation Analysis:
- A correlation matrix was calculated for the numerical features to understand the relationships between variables. Some key insights include:
pdays
andprevious
show a strong negative correlation (-0.59).emp.var.rate
,euribor3m
, andnr.employed
are strongly positively correlated with each other.duration
has a minimal correlation with other features but is crucial in predicting the target variable.
- A correlation matrix was calculated for the numerical features to understand the relationships between variables. Some key insights include:
-
Splitting Data:
- The dataset was split into training and test sets to evaluate model performance.
-
Feature Engineering: Preprocessing Pipeline
- For Logistic Regression, KNN, and SVM:
- Numerical features were scaled using
StandardScaler
. - Categorical features were encoded using
OrdinalEncoder
.
- Numerical features were scaled using
- For Decision Trees:
- No scaling was required as decision trees are insensitive to feature scaling.
- Categorical features were encoded using
OrdinalEncoder
.
- K-Nearest Neighbors (KNN)
- Logistic Regression
- Decision Trees
- Support Vector Machines (SVM)
Several machine learning models were trained, and their performance was compared:
Model | Train Time | Train Accuracy | Test Accuracy | Precision | Recall | F1-Score | AUC |
---|---|---|---|---|---|---|---|
Logistic Regression | 0.24 | 0.91 | 0.91 | 0.66 | 0.42 | 0.51 | 0.93 |
Decision Tree | 0.20 | 1.00 | 0.89 | 0.51 | 0.53 | 0.52 | 0.73 |
KNN | 0.00248 | 0.93 | 0.90 | 0.59 | 0.40 | 0.47 | 0.86 |
SVM | 188.65 | 0.91 | 0.91 | 0.68 | 0.34 | 0.45 | 0.93 |
- Decision Tree had the best performance but with a lower AUC of 0.73 and the highest F1 score of 0.52, indicating a good balance between precision and recall.
- Logistic Regression also performed well, similar F1 score 0.51 and better AUC 0.93.
- KNN was fastest but had lower F1 and AUC.
- SVM computationally very expensive as compared to other models and had lowest F1.
Hyperparameter tuning was performed using cross-validation and Grid Search:
- Logistic Regression: Best parameters found were
C=0.1
,penalty='l1'
, andsolver='saga'
. - Decision Tree: Best parameters were
criterion='gini'
,max_depth=5
,min_samples_leaf=1
, andmin_samples_split=2
. - KNN: Optimal parameters were
n_neighbors=21
,p=2
, andweights='distance'
. - SVM: The RBF kernel was found to be the best, but the model was computationally expensive for the given dataset, not evaluated further due to compute constraints.
- Accuracy: The proportion of correctly classified instances.
- Precision, Recall, F1-Score: To assess the balance between false positives and false negatives.
- ROC-AUC: To evaluate the model's ability to distinguish between the two classes.
The following table summarizes the performance metrics for the improved models:
Model | Train Time (s) | Train Accuracy | Test Accuracy | Precision | Recall | F1 Score | AUC |
---|---|---|---|---|---|---|---|
Logistic Regression | 39.04 | 0.9107 | 0.9098 | 0.6576 | 0.4159 | 0.5096 | 0.9333 |
Decision Trees | 52.84 | 0.9175 | 0.9142 | 0.6503 | 0.5151 | 0.5749 | 0.9303 |
KNN | 541.80 | 1.0000 | 0.9024 | 0.6330 | 0.3179 | 0.4232 | 0.9129 |
-
Logistic Regression:
- Achieved a balanced performance with a strong AUC of 0.9333, indicating good model discrimination between classes.
- Precision and recall are moderate, with room for improvement in the recall to capture more true positives.
-
Decision Trees:
- The Decision Tree model delivered the best F1 Score of 0.5749, balancing precision and recall effectively.
- This model also showed robust performance across all metrics, with the highest test accuracy of 0.9142 and AUC 0.9303.
-
K-Nearest Neighbors (KNN):
- While KNN achieved perfect train accuracy, indicating it might be overfitting, its test accuracy was slightly lower at 0.9024.
- KNN has the lowest recall at 0.3179, indicating it missed many true positives, but it maintained a decent AUC.
The Decision Tree model stands out as the best performer for this classification task, especially in terms of the F1 score and test accuracy. This model offers a good balance between precision and recall, making it a strong candidate for deployment. Logistic Regression also performed well and could be considered depending on the specific use case requirements. However, KNN, despite its high training accuracy, may require further tuning or regularization to improve generalization and recall.
These findings suggest that feature engineering, further hyperparameter tuning, and model selection will be crucial in optimizing performance for this dataset.
The decision tree model provides insight into different customer segments, guiding targeted marketing strategies:
- Characteristics: Customers in regions with low employment rates and short call durations.
- Behavior: Less likely to subscribe unless call duration is slightly longer.
- Recommendations: Engage customers by extending conversations and focusing on understanding their needs.
- Characteristics: Customers in low employment regions with longer call durations.
- Behavior: More likely to subscribe, especially if there have been few prior contacts.
- Recommendations: Use the extended call time to build a strong connection and present term deposits as a secure investment option.
- Characteristics: Customers in high employment regions but with short call durations.
- Behavior: Less likely to subscribe, especially with low economic confidence.
- Recommendations: Address concerns early in the conversation and suggest follow-up emails or digital resources.
- Characteristics: Customers in regions with high employment, long call durations, and favorable economic conditions.
- Behavior: More likely to subscribe, particularly if economic indicators are positive.
- Recommendations: Emphasize the benefits of term deposits and introduce related financial products.
- Characteristics: Customers in regions with high employment, long call durations, but facing less favorable economic indicators.
- Behavior: Decision to subscribe is nuanced and influenced by economic conditions.
- Recommendations: Provide products with flexibility and stress the security of the investment.
- Tailored Campaigns: Develop targeted marketing campaigns based on customer segments, focusing on their specific needs and preferences.
- Call Duration Optimization: Implement strategies to increase call duration for segments with lower conversion rates.
- Economic Indicators: Continuously monitor economic indicators and adjust marketing strategies accordingly.
- Agent Training: Equip sales agents with knowledge about different customer segments and how to tailor their approach.
- Customer Relationship Management (CRM): Utilize CRM systems to track customer interactions, preferences, and purchase history for personalized marketing.
- The final model is ready for deployment, allowing the bank to predict the likelihood of clients subscribing to long-term deposits. The model can be integrated into the bank's CRM system for real-time decision-making.
- Python 3.x
- pandas, numpy, scikit-learn, matplotlib, seaborn (Python libraries)
Clone the Repository:
git clone https://github.com/mitbans/Bank-Marketing-Campaigns-Analysis.git
- Deploy the model using a web application framework like Flask or Streamlit for user-friendly interaction.
data/bank-additional-full.csv
: Contains dataset used in the analysis.images/decision_tree.pdf
: Contains final Decision Tree Model.notebooks/Predicting-Long-Term-Deposit-Success.ipynb
: Jupyter notebook with code for data analysis.README.md
: Summary of findings and link to notebook
The detailed analysis and code can be found in the Jupyter notebook here.