I made two models for predicting reorder & None. Following are the features I made.
- How often the user reordered items
- Time between orders
- Time of day the user visits
- Whether the user ordered organic, gluten-free, or Asian items in the past
- Features based on order sizes
- How many of the userās orders contained no previously purchased items
- How often the item is purchased
- Position in the cart
- How many users buy it as "one shot" item
- Stats on the number of items that co-occur with this item
- Stats on the order streak
- Probability of being reordered within N orders
- Distribution of the day of week it is ordered
- Probability it is reordered after the first order
- Statistics around the time between orders
- Number of orders in which the user purchases the item
- Days since the user last purchased the item
- Streak (number of orders in a row the user has purchased the item)
- Position in the cart
- Whether the user already ordered the item today
- Co-occurrence statistics
- Replacement items
- Counts by day of week
- Counts by hour
More detail, please refer to codes.
Regarding F1 maximization, I hadn't read that paper until Faron had published the kernel. But I got high score because of my F1 maximization. Let me explain it. For maximizing F1, I generate y_true according to predicted prob. And check F1 from higher prob. For example, lets say we have ordered item and prob, like {A: 0.3, B:0.5, C:0.4}. Then generate y_true in many times. In my case, generated 9999 times. So now we have many of y_true, like [ [A,B],[B],[B,C],[C],[B],[None].....]. As I mentioned above, next thing we do is to check F1 from [B], [B,C], [B,C,A]. Then we can estimate F1 peak out, and stop calculation, and go next order. You may know, in this method, we don't need to check all pattern, like [A],[A,B],[A,B,C],[B]... I guess some might have figured out this method from my comment of "tips to go farther". However, this method is time consuming as well as depends on seed. So finally I used Faron's kernel. Fortunatelly or not, I got almost same result using Faron's kernel. Please refer to py_model/pyx_get_best_items.pyx
- cd py_feature
- python 901_run_feature.py
- python 902_run_concat.py
- cd ../py_model
- python 999_run.py
Around 300 GB RAM needed(sorry). But I confirmed we can get 0.4073 on private LB with only around 60 GB RAM. Also if you don't have enough memory and want to get high score, try continuous training using xgb_model of xgb.train.
Python packages:
- numpy==1.12.1
- pandas==0.19.2
- scipy==0.19.0
- tqdm==4.11.2
- xgboost==0.6