/Tencent-Ads-Algo-Comp-2020

Tencent Advertisement Algorithm Competition 2020 / 2020腾讯广告算法大赛

Primary LanguagePythonMIT LicenseMIT

Tencent-Ads-Algo-Comp-2020

MIT license

Git repo for Tencent Advertisement Algorithm Competition


Quick Start

cd ./Script
. prerequisite.sh
python3 input_generate.py
python3 input_split.py fine
python3 train_w2v.py creative 128
python3 train_w2v.py ad 128
python3 train_w2v.py advertiser 128
python3 train_w2v.py product 128
python3 train_v2_age_lstm_multiInp.py 40 512 100 1e-3

Script Documentation

Model Training V2

  • How to run training script

    Syntax: python3 train_v2_{some script name}.py 10 512 100 1e-3

    Argument:

    1. (Required,INT) 0 means training from scratch and a positive number means loading the corresponding epoch and start training from there
    2. (Required,INT) target epoch to train
    3. (Required,INT) batch size for training
    4. (Required,INT) maximal length of input sequence, smaller length can help train withb larger batch size
    5. (Required,FLOAT) learning rate for adam optimizer
    6. (Optional, INT) If nothing specified then the model will be trained from scratch, otherwise it indicates the epoch to resume training
    7. (Optional, INT) If nothing specified then the model will be trained from scratch, otherwise it indicates the training file to resume training
      • Example: 9, 2 indicates resume training from epoch 9 file 2.
  • Training script inventory

    |--Script
      |--data_loader_v2.py
      |
      |--clf_lstm.py             # Model based on stacked LSTM
      |--clf_gnmt.py             # Model based on GNMT (Google Neural Translation Machine)
      |--clf_tf_enc.py           # Model based on Encoder part of Transformer
      |--clf_esim.py             # Model based on ESIM (Enhanced Sequential Inference Model)
      |--clf_pre_ln_tf.py        # Model based on pre Layer Normalization Transformer
      |
      |--train_v2_age_lstm_multiInp.py
      |--train_v2_age_lstm_v2_multiInp.py
      |--train_v2_age_tf_enc_multiInp.py
      |--train_v2_age_gnmt_multiInp.py
      |--train_v2_age_esim_multiInp.py
      |--train_v2_age_pre_ln_tf_multiInp.py
      |
      |--train_v2_gender_lstm_multiInp.py
    

Legacy - Model Training V1

  • How to run training script

    Syntax: python3 train_{some script name}.py 0 10 256 100 1e-3 split

    Argument:

    1. (Required,INT) 0 means training from scratch and a positive number means loading the corresponding epoch and start training from there
    2. (Required,INT) number of epoches to train
    3. (Required,INT) batch size for training
    4. (Required,INT) maximal length of input sequence, smaller length can help train withb larger batch size
    5. (Required,FLOAT) learning rate for adam optimizer
    6. (Optional) If nothing specified then the model will be trained using unsplitted files. If python3 input_split.py fine has been executed and a value is specified the model will be trained using a list of splitted files.
  • Training script inventory

    |--Script
      |--data_loader.py
      |
      |--multi_seq_lstm_classifier.py
      |--train_age_multi_seq_lstm_classifier.py
      |--train_gender_multi_seq_lstm_classifier.py
      |
      |--transformer_encoder_classifier.py
      |--train_age_transformer_encoder_classifier_with_creative.py
      |
      |--GNMT_classifier.py
      |--train_age_GNMT_classifier_with_creative.py
      |
      |--multi_seq_GNMT_classifier.py
      |--train_age_multi_seq_GNMT_classifier.py
    

Data Preparation

  • Step 1: run
cd ./Script
. prerequisite.sh

Note that if the instance has no public internet connection, download train file and test file and put them under /Script. You should have the following files and directories after execution.

|--Script
  |--train_artifact
    |--user.csv
    |--click_log.csv
    |--ad.csv
  |--test_artifact
    |--click_log.csv
    |--ad.csv
  |--input_artifact
  |--embed_artifact
  |--model_artifact
  |--output_artifact
  • Step 2: run
python3 input_generate.py
python3 input_split.py

For machine with small memory please replace the second line with python3 input_split.py fine.You should have the following files after execution.

|--Script
  |--input_artifact
    |--train_idx_shuffle.npy
    |--train_age.npy
    |--train_gender.npy
    |--train_creative_id_seq.pkl
    |--train_ad_id_seq.pkl
    |--train_advertiser_id_seq.pkl
    |--train_product_id_seq.pkl
    |--test_idx_shuffle.npy
    |--test_creative_id_seq.pkl
    |--test_ad_id_seq.pkl
    |--test_advertiser_id_seq.pkl
    |--test_product_id_seq.pkl
  |--embed_artifact
    |--embed_train_creative_id_seq.pkl
    |--embed_train_ad_id_seq.pkl
    |--embed_train_advertiser_id_seq.pkl
    |--embed_train_product_id_seq.pkl
  |--model_artifact
  |--output_artifact
  |--train_artifact
  |--test_artifact
  • Step 3: run
python3 train_w2v.py creative 128
python3 train_w2v.py ad 128
python3 train_w2v.py advertiser 128
python3 train_w2v.py product 128

You should have the following files after exection.

|--Script
  |--embed_artifact
    |--w2v_registry.json
    |--creative_embed_s256_{random token}
    |--ad_embed_s256_{random token}
    |--advertiser_embed_s128_{random token}
    |--product_embed_s128_{random token}
    |--embed_train_creative_id_seq.pkl
    |--embed_train_ad_id_seq.pkl
    |--embed_train_advertiser_id_seq.pkl
    |--embed_train_product_id_seq.pkl
  |--model_artifact
  |--input_artifact
  |--output_artifact
  |--train_artifact
  |--test_artifact

Note that w2v_registry.json stores all the w2v model artifact paths.

Materials