This repository contains my solution for the Instacart Market Analysis Competition hosted on kaggle. It helped me earn 39/2669 place. For anyone who is interested, please check this page for details about the Instacart competition.
the Instacart dataset and can be used in other e-commerce datasets by modifying the input easily.
The problem is formulated as approximating P(u,p | user's prior purchase history) which stands for how much likely user u would repurchase product p given prior purchase history.
The main model should be a binary classifer and features created manually or automatically are fed to the classfier for generating predictions.
I construted both manual features and automatic features using unsupervised learning and neural networks.
Manual features include statistics of prior purchase history. As for automatic features, I used the following
LDA
- each user is treated as a document and each product is treated as a word.
- generate topic representation of each user and each product
- calculate score by taking inner product for <u,p>
- similar operations on aisle and department level
- both score <u,p> and compressed topic representations of users/items serves as good features
WORD2VEC
- similar as LDA, but only on product level
LSTM
- Interval between user u's sequential purchase of product p is modeled as a time sequence.
- Use LSTM to construct regression model for predicting next value of this time sequence.
- The predicted next interval serves as a good feature.
DREAM
- RNN and bayesian personalized rank based Model, refer to this repo for my implementation
- DREAM provides <u,p> scores, dynamic user representaions and item embeddings.
- It captures the sequential information such as periodicity in users' prior purchase history.
I constructed both lightgbm model and xgboost model.
I used bayesian optimization to tune my lightgbm model.
I used this script to contruct orders from <u,p> pair. Thanks to faron, shing and tarbox !
I trained big models (500+ features), median models(260 + features) and small models(80+ features). Final submissions were generated by bagging top models using median.
Python Files
bayes_optim_lgb.py
- lightGBM model tuned by bayesian optimization
lgb_cv.py
- lightGBM model 5-fold cv
xgb_train_eval_test.py
- xgboost model
transactions.py
- craft features manually from raw transaction log/user purchase history
feats.py
- combine all features and make train/test dataset
inference.py
- construct orders using P(u,p)
evaluation.py
- some functions related to local evaluation
constants.py
- some constants such as file path
utils.py
- some useful functions
Jupyter notebbooks
EDA and Feat Craft
- dataset exploration and feature crafting
Evaluation and Bagging
- local evaluation and bagging models
Submission and Bagging
- generate submissions
Copyright (c) 2017 Yihong Chen