Our solution acheived 1-st place on the private leaderboard.
We split train data at 6 folds. So we have about 3000 pairs in fold. This is similar as contest validation and test set.
Final validation quality (with ensemble):
mean [values by folds]
pre: 0.5425 [0.549 , 0.533 , 0.546 , 0.541 , 0.541 , 0.545 ]
mrr: 0.2028 [0.202 , 0.204 , 0.204 , 0.202 , 0.202 , 0.203 ]
r1: 0.2952 [0.296 , 0.295 , 0.297 , 0.294 , 0.294 , 0.295 ]
Final validation quality (with single model):
mean [values by folds]
pre: 0.5025 [0.514 , 0.497 , 0.506 , 0.496 , 0.498 , 0.504 ]
mrr: 0.1947 [0.194 , 0.198 , 0.194 , 0.194 , 0.193 , 0.195 ]
r1: 0.2808 [0.282 , 0.283 , 0.281 , 0.279 , 0.279 , 0.281 ]
This is neural network based solution. We use pytorch-lifestream library to simplify work with sequences data.
We create neural network to encode transaction sequences and clickstream to vector representation. Matching and ranking based on L2 distance between transaction sequence embedding and clickstream embedding.
Network include two stages:
- TrxEncoder, which transform each individual transaction or click into vector representation
- SequenceEncoder, which reduce sequence of vectors from TrxEncoder into one final vector
Training batch contains 128 pairs of bank-rtk users. Short transaction subsequence sampled twice from each bank user and the same for clickstream. So we have 256 trx sequences samples for 128 bank users and 256 click sequences samples for 128 rtk users. Then we make 256*256 matrix with pairwise L2 distances, and the matrix with match-unmatched labels. Labels used to sample positive and negative samples for Softmax Loss.
SequenceEncoder is RNN network, shared for trx and clicks. Weights initialised randomly. TrxEncoder is a mix of embedding for categorical features and normalised numerical. We make TrxEncoder for transactions and TrxEncoder for clicks. We pretrain weights for both TrxEncoder.
We were inspired BERT architecture for pretrain. We use TrxEncoder + TransformerEncoder and MLM (Masked Language Model) task. TransformerEncoder predict masked TrxEncoder vectors.
We use ensembling in final solution. We train 11 models and average paired distances from output.
We used Tesla V100 with 32Gb. The same settings also works with 16Gb.
Stage | Time |
---|---|
Full train time with 5 models in ensemble | 142 min |
- Data load and preprocessing | 8 min |
- Pretrain TrxEncoder (both trx and click) | 19 min |
- Train one SeqEncoder (there are 5 in ensemble) | 23 min |
Full inference time (2930 bank uids, 2445 rtk uids) | 2 min |
pipenv sync --dev
pipenv shell
# Load data
sh get_data.sh
# Split data
python src/split_folds.py
cd src/
# Pipeline with validation, valid_fold_id from [0, 1, 2, 3, 4, 5] shold be provided as parameter
python nn_train.py valid_fold_id=4
python nn_inference.py valid_fold_id=4
python nn_evaluation.py valid_fold_id=4
# Pipeline without validation, train model before submit
python nn_distance_train.py valid_fold_id=None
# there are no data for validation, all data was in train
# python nn_distance_inference.py