CS249 Mining Information and Social Networks

CTR prediction

Kaggle Project: (https://www.kaggle.com/c/outbrain-click-prediction)

Data source: https://www.kaggle.com/c/outbrain-click-prediction/data

Overview:

Basic classifier (SVM, FTRL) on basic features
Feature engineering:
- Document-wise feature construction (TF-IDF)
- Generating categorical features
- Feature Selection
Train Random Forest and Gradient Boosting tree model on Mean target value features
Fast Factorized Machine model
Ensembling with Gradient Boosting tree

Files desciption:

0_prepare_splits.py splits the training dataset into two folds: One for training, the other for validation

1_svm_data.py Process basic features for SVM by reading through 2 files: events.csv and promoted_content.csv

ad_display_str = [uuid, document_id, platform, dow, hours, dow_hour, geo_location]

Eg: 1,42337,0,0,addoc_938164 campaign_5969 adv_1499 u_cb8c55702adb93 d_379743 p_3 dow_1 hour_4 dow_hour_1_4 US US_SC US_SC_519

2_train_svm.py train SVM model on features generated from 1_svm_data.py. CV on 2 fold. Use AUC, F1, etc as metrics. Time & Result: building the train matrix took 35.4096m C=0.1, took 14324.763s, auc=0.734, prec=0.600, f1=0.164
3_doc_similarity_features.py calculates TF-IDF similarity between the document user on and the ad document
4_categorical_data_join.py and 4_categorical_data_unwrap_columnwise.py prepare data for mean target value features calculation
4_mean_target_value.py calculates mean target value for all features from categorical_features.txt
5_best_mtv_features_xgb.py builds an eXtreme Gradient Boosting (XBG) on a small part of data and selects best features based on information gain
5_mtv_rf.py trains Random Forest model on MTV features
5_mtv_xgb.py trains XGB model on MTV features and creates leaf features to be used in FFM
6_1_generate_ffm_data.py creates the input file to be read by ffmlib
6_2_split_ffm_to_subfolds.py splits each fold into two subfolds (can't use the original folds because the leaf features are not transferable between folds)
6_3_run_ffm.sh runs libffm for training FFM models
6_4_put_ffm_subfolds_together.py puts FFM predictions from each fold/subfold together
7_ensemble_data_prep.py puts all the features and model predictions together for ensembling
7_ensemble_xgb.py traings the second level XGB model on top of all these features
8_gen_net_line.py Generate the (display+adid) - (ad_docid) - (display+adid) network for LINE input, mapped to index
8_line_classifiers.py Using the LINE embedding feature vectors (tmp/) to train other models.

Cite:

echowand/CS249

CS249 Mining Information and Social Networks

CTR prediction