/Kaggle_Customer_Transation_Prediction

My Final Submission for the 'Santander Customer Transaction Prediction'. I have participated in this very tough and interesting competition on Kaggle a while ago and I finally got the time to put all the work together in this Repo.

Primary LanguageJupyter Notebook

Kaggle Competition : 'Santander Customer Transation Prediction'

made-with-python Maintenance Ask Me Anything ! Open Source Love png1

My Final Submission for the 'Santander Customer Transaction Prediction':

In this repo, I assamble some of the work I did during an interesting ( and a very tough) Kaggle competition.

Here is the official link of the competition on Kaggle..

It was a true learning experience for me to participate in the challenge. What mad it a special competition is the number of talented and smart participants for all over the world. I would particularly mention all the Kaggle masters and grand-masters that were driving the challenge to higher levels and providing ideas and hints all along the way.

Here is a part of the description provided by the competition hosts:

"... In this challenge, we invite Kagglers to help us identify which customers will make a specific transaction in the future, irrespective of the amount of money transacted. The data provided for this competition has the same structure as the real data we have available to solve this problem. "

1-Exploratory Data Analysis Notebook

In this notebook, I tried to go though the data and see if I could notice a certain pattern or an interesting trend. It was one of the most interesting phases of this competition because:

  • The data was 'clean' : no heavy work was required to put the variables into shape.
  • The data was synthetic : the data set was not a real world production data, but it was generated by an algorithm to simulate the behavior of customers and to be as close as possible to the actual ‘Santander’ customer data.
  • Almost every participant was stuck at a certain performance threshold : it was very hard to enhance the model beyond a certain performance point.

2-LightGBM model with Data Augmentation

I have experimented with various models and technics, But the model that had the highest performance point was the LightGBM. Stacking and bleing was also a huge part of the top 1% winning solutions. However, I have tried to keep it simple and to get though it step by step and understand how the data is behaving after passing through each different model.

One of the 'magic' ideas that were discussed in the competition forum was feature engineering and especially data augmentation. Other feature engineering ideas were applied, such as creating 100s of new variables as a blend of existing variables, and doing all possible and imaginable combinations.

3-Other ideas that did not work (Work in progress)

Here, I will try to assamble all (most) of the ideas that I have tried but did not work.

It was mostly different models (XGBoost, Regressions, Basic Neural Network models... ect.)

Info:

I could not share the competition data due to the competition rules. The competition host requires an explicit acceptance of the competition rules by the user before having access to the data set. To be able to get the competition data, you should have a kaggle account, access the competition page and agree on the competition rules.