Git repo for Tencent Advertisement Algorithm Competition
cd ./Script
. prerequisite.sh
python3 input_generate.py
python3 input_split.py fine
python3 train_w2v.py creative 128
python3 train_w2v.py ad 128
python3 train_w2v.py advertiser 128
python3 train_w2v.py product 128
python3 train_w2v.py industry 64
python3 train_w2v.py product_category 64
python3 train_v2_age_final_resgru_multiInp.py 40 1024 100 1e-3
-
How to run training script
Syntax:
python3 train_v2_{some script name}.py 40 2048 100 1e-3
Argument:
- (Required,INT) target epoch to train
- (Required,INT) batch size for training
- (Required,INT) maximal length of input sequence, smaller length can help train withb larger batch size
- (Required,FLOAT) learning rate for adam optimizer
- (Optional, INT) If nothing specified then the model will be trained from scratch, otherwise it indicates the epoch to resume training
- (Optional, INT) If nothing specified then the model will be trained from scratch, otherwise it indicates the training file to resume training
- Example:
9, 2
indicates resume training from epoch 9 file 2.
- Example:
-
Training script inventory
|--Script |--data_loader_v2.py | |--clf_lstm.py # Model based on stacked LSTM |--clf_gnmt.py # Model based on GNMT (Google Neural Translation Machine) |--clf_tf_enc.py # Model based on Encoder part of Transformer |--clf_esim.py # Model based on ESIM (Enhanced Sequential Inference Model) |--clf_pre_ln_tf.py # Model based on pre Layer Normalization Transformer |--clf_rcnn.py # Model based on RCNN |--clf_final.py # Model for final submission | |--train_v2_age_final_resgru_multiInp.py |--train_v2_age_final_resgru_cnn_multiInp.py |--train_v2_age_final_preln_tf_multiInp.py
-
How to run training script
Syntax:
python3 train_{some script name}.py 0 10 256 100 1e-3 split
Argument:
- (Required,INT) 0 means training from scratch and a positive number means loading the corresponding epoch and start training from there
- (Required,INT) number of epoches to train
- (Required,INT) batch size for training
- (Required,INT) maximal length of input sequence, smaller length can help train withb larger batch size
- (Required,FLOAT) learning rate for adam optimizer
- (Optional) If nothing specified then the model will be trained using unsplitted files. If
python3 input_split.py fine
has been executed and a value is specified the model will be trained using a list of splitted files.
- (Required,INT) 0 means training from scratch and a positive number means loading the corresponding epoch and start training from there
-
Training script inventory
|--Script |--data_loader.py | |--multi_seq_lstm_classifier.py |--train_age_multi_seq_lstm_classifier.py |--train_gender_multi_seq_lstm_classifier.py | |--transformer_encoder_classifier.py |--train_age_transformer_encoder_classifier_with_creative.py | |--GNMT_classifier.py |--train_age_GNMT_classifier_with_creative.py | |--multi_seq_GNMT_classifier.py |--train_age_multi_seq_GNMT_classifier.py
- Step 1: run
cd ./Script
. prerequisite.sh
Note that if the instance has no public internet connection, download train file and test file and put them under /Script
. You should have the following files and directories after execution.
|--Script
|--train_artifact
|--user.csv
|--click_log.csv
|--ad.csv
|--test_artifact
|--click_log.csv
|--ad.csv
|--input_artifact
|--embed_artifact
|--model_artifact
|--output_artifact
- Step 2: run
python3 input_generate.py
python3 input_split.py
For machine with small memory please replace the second line with python3 input_split.py fine
.You should have the following files after execution.
|--Script
|--input_artifact
|--train_idx_shuffle.npy
|--train_age.npy
|--train_gender.npy
|--train_creative_id_seq.pkl
|--train_ad_id_seq.pkl
|--train_advertiser_id_seq.pkl
|--train_product_id_seq.pkl
|--test_idx_shuffle.npy
|--test_creative_id_seq.pkl
|--test_ad_id_seq.pkl
|--test_advertiser_id_seq.pkl
|--test_product_id_seq.pkl
|--embed_artifact
|--embed_train_creative_id_seq.pkl
|--embed_train_ad_id_seq.pkl
|--embed_train_advertiser_id_seq.pkl
|--embed_train_product_id_seq.pkl
|--model_artifact
|--output_artifact
|--train_artifact
|--test_artifact
- Step 3: run
python3 train_w2v.py creative 128
python3 train_w2v.py ad 128
python3 train_w2v.py advertiser 128
python3 train_w2v.py product 128
python3 train_w2v.py industry 64
python3 train_w2v.py product_category 64
You should have the following files after exection.
|--Script
|--embed_artifact
|--w2v_registry.json
|--wv_registry.json
|--creative_sg_embed_s256_{random token}
|--...
|--model_artifact
|--input_artifact
|--output_artifact
|--train_artifact
|--test_artifact
Note that w2v_registry.json
stores all the w2v model artifact paths and wv_registry.json
stores all the KeyedVector
artifact paths.