Kaggle Project: (https://www.kaggle.com/c/outbrain-click-prediction)
Data source: https://www.kaggle.com/c/outbrain-click-prediction/data
Overview:
-
Basic classifier (SVM, FTRL) on basic features
-
Feature engineering:
- Document-wise feature construction (TF-IDF)
- Generating categorical features
- Feature Selection
-
Train Random Forest and Gradient Boosting tree model on Mean target value features
-
Fast Factorized Machine model
-
Ensembling with Gradient Boosting tree
Files desciption:
0_prepare_splits.py splits the training dataset into two folds: One for training, the other for validation
1_svm_data.py
Process basic features for SVM by reading through 2 files: events.csv and promoted_content.csv
ad_display_str = [uuid, document_id, platform, dow, hours, dow_hour, geo_location]
Eg: 1,42337,0,0,addoc_938164 campaign_5969 adv_1499 u_cb8c55702adb93 d_379743 p_3 dow_1 hour_4 dow_hour_1_4 US US_SC US_SC_519
-
2_train_svm.py
train SVM model on features generated from 1_svm_data.py. CV on 2 fold. Use AUC, F1, etc as metrics. Time & Result: building the train matrix took 35.4096m C=0.1, took 14324.763s, auc=0.734, prec=0.600, f1=0.164 -
3_doc_similarity_features.py
calculates TF-IDF similarity between the document user on and the ad document -
4_categorical_data_join.py
and4_categorical_data_unwrap_columnwise.py
prepare data for mean target value features calculation -
4_mean_target_value.py
calculates mean target value for all features from categorical_features.txt -
5_best_mtv_features_xgb.py
builds an eXtreme Gradient Boosting (XBG) on a small part of data and selects best features based on information gain -
5_mtv_rf.py
trains Random Forest model on MTV features -
5_mtv_xgb.py
trains XGB model on MTV features and creates leaf features to be used in FFM -
6_1_generate_ffm_data.py
creates the input file to be read by ffmlib -
6_2_split_ffm_to_subfolds.py
splits each fold into two subfolds (can't use the original folds because the leaf features are not transferable between folds) -
6_3_run_ffm.sh
runs libffm for training FFM models -
6_4_put_ffm_subfolds_together.py
puts FFM predictions from each fold/subfold together -
7_ensemble_data_prep.py
puts all the features and model predictions together for ensembling -
7_ensemble_xgb.py
traings the second level XGB model on top of all these features -
8_gen_net_line.py
Generate the (display+adid) - (ad_docid) - (display+adid) network for LINE input, mapped to index -
8_line_classifiers.py
Using the LINE embedding feature vectors (tmp/) to train other models.
Cite: