This project is focused on creating a machine learning model to predict customer conversion rates for an auto insurance company. The project includes data analysis, model building, and the creation of a BI dashboard to visualize the results.
Objectives:
- Identify quoted policies that the company will convert (a.k.a. issue)
- Understand key characteristics of policies the company tends to write, as well as those they tend not to write (e.g. understand quoted policies with both high and low conversion rates)
- Provide a recommendation on how this information could be leveraged at the company
The data for this project was provided by Travelers, including customer historical and demographic information.
The data was cleaned and explored to identify any trends or patterns that could be used to predict conversion rates.
In our exploratory data analysis (EDA) process, we used different techniques in order to gain insights, identify patterns, and test hypotheses.
- Visualization: We used the most common visualizations like line graphs, bar charts, and histograms. They are simple but effective to view and understand the data in a visual format.
- Summary statistics: Calculating summary statistics such as mean, median, mode, and standard deviation help us get a sense of the distribution of numeric data. We have identified outliers in Age variable and skew distribution in other variables, such as safety rating and quoted amount.
- Data cleaning and preparation: To ensure the data was in a usable state, we first performed data cleaning to correct any data type issues and ensure that the data was in good condition. There are various techniques for dealing with missing values and outliers, and we applied different methods as needed throughout the project to meet our specific objectives. For instance, we used box-cox transformation to normalize skewed variables.
- Correlation analysis: Identifying correlations between different variables can help understand how variables are related and identify potential causal relationships. In our analysis, we calculated bivariate correlations using the Pearson correlation coefficient and visualized them with a heatmap.
- Statistical testing: We conducted an A/B test to compare the conversion rate of customers who were given a discount versus those who were not. Our analysis revealed that the discount had a significant impact on improving conversion, but only prior to 2017. For more detailed information, please see the jupyter notebook.
Several machine learning algorithms were tested and evaluated to determine the best model for the task, including lightGBM, XGBoost, and Neural Networks.
The final model chosen for this project was a lightGBM. The model was trained on 70% of the data, and tested on the remaining 30% to evaluate its performance. We use cross-validation to reduce the impact of randomness in the training set.
We were trying to improve the performance of our predictive model by incorporating new features through feature engineering. We were not sure how the benchmark model was originally trained, but we suspected that adding additional features from all datasets might help to increase its accuracy. We applied various feature engineering techniques to create new features that we thought might be useful for our model. These techniques included combining multiple datasets, creating calculated fields, and performing data aggregation.
We implemented a recursive feature selection process to identify the most important variables for our predictive model. To start, we used data visualization techniques to explore the relationships between the generated variables and conversion. This helped us to get a sense of which variables might be the most useful for our model.
Next, we applied statistical methods to eliminate features that were not important to the model. This allowed us to narrow down our list of potential variables and focus on the ones that had the most impact on model performance. These techniques included variance threshold, chi2 test, recursive feature selection, etc.
Finally, we used a feature importance plot to further refine our selection process. This plot was generated by lightGBM after the training process and showed us the relative importance of each variable in the model, and we used it to identify the most important variables to include in our final model. Overall, this recursive feature selection process helped us to identify the key variables that contributed the most to our model's performance and eliminated those that were not as important.
A BI dashboard was created and deployed on Streamlit Cloud to visualize the results of the model and allow for easy interpretation by stakeholders. The dashboard includes four topics.
- The time series analysis section includes visualizations and analysis related to how conversion metrics have changed over time.
- The customer segmentation analysis section includes five customer groups and analysis related to how customers are segmented and how different segments behave.
- The marketing and sales analysis section includes visualizations and analysis related to the general marketing and sales efforts and their effectiveness.
- The model prediction section utilizes a predictive model to forecast customer conversions based on user-input data. The model's predictions are accompanied by recommendations for improving conversion rates and maximizing revenue through targeted marketing and sales efforts. This analysis is valuable in identifying opportunities for optimization and growth.
Overall, this project was successful in creating a machine learning model that can accurately predict customer conversion rates for an auto insurance company. The BI dashboard provides a comprehensive tool for stakeholders to understand and interpret the results of the model. It was designed to provide a holistic view of the business and help companies make informed decisions based on historical data.
There are several areas for future improvement, including feature engineering, model fine tuning, etc.