The solution pipeline consists of:
- Generating Covisitation Matrices
- Splitting the Data as Train/Val
- Generation Co-Occurrence Matrices
- Candidate Generation
- Feature Extraction
- Training
- Inference
Generated Top-100 AIDs for three well-known covisitation schemes given in: link
Used the local validation scheme given by Radek. The local train data was also splitted into two by sessions in order to avoid possible leakage during model training and score calculation. The implementation can be seen in the corresponding notebook.
Generated all pair occurrences for all AIDs among all sessions for all action pairs (click-cart, cart-order, etc.). This is used for feature extraction later.
Used the public candidate generation script and generated 100 candidates for all action types.
Generated features for following data subsets:
- Items
- Sessions
- Item-Session Combinations
- Covisitation and Co-Occurrence Statistics
- Statistics generated from hour, weekday and weekend status
- Count features (bool for >0 and >1, rank among all)
- Unique count features (unique count and rank among all)
- Distribution of action types in percentiles
- Inclusion rate by all sessions
- Occurrence rate in the last week of data
- Average number of times seen in the same sessions at different times
- All of the above with filtered separately for all action types
- Statistics generated from hour, weekday and weekend status
- Count features (bool for >0 and >1, rank among all)
- Unique count features (unique count and rank among all)
- Distribution of action types in percentiles
- Length of the session
- Features generated by extracting mini-sessions according to the time differences between actions
- Statistics generated from multiple purchases made in a single basket
- Rates of taking products to the next action within the same session (click->cart, cart->order)
- All of the above with filtered separately for all action types
- Statistics generated from hour, weekday and weekend status
- Count features (bool for >0 and >1, rank among all)
- Unique count features (unique count and rank among all)
- Distribution of action types in percentiles
- Reversed order of the item in the session
- Time difference between the latest occurrence of the item and the start - end of the session
- Statistics generated from covisitation and co-occurrence scores between candidate items and items in the session's history
Used the following config:
- Model: XGBoost
- Fold Scheme: 5-Fold (Grouped by "session")
- Negative Sampling Fraction: 15%
- Dropped sessions with no positive labels
- Used the first half of splitted local training set
- Used mean blending
- Executed on the second half of splitted local training set when running local validation
- Weekday-Specific aggregations
- Word2Vec features
- Different models (CatBoost, LGBM)
- Comprehensive pair scores (because of OOM errors)
- Max-median blending
- Early-stopping
- Higher negative fractions
- Different objective metrics
- Different fold counts