/google-ctr-prediction

POC for google CTR Prediction

Primary LanguageJupyter Notebook

SleepBot

Maintenance made-with-python

SleepBot is a Click Prediction Bot which uses the data from google search console to predict the Net Clicks of Keyword for a given date.

Dataset

The bot is trained using the FDF which has:

  • 1368792 Records
  • 74 Features
  • 19269 Keywords

Methodology

The Dataset is divided into three classes based on the Net Clicks:

  • Clicks Level 0 If Net Clicks is 0
  • Clikcs Level 1 If Net Cliks is 1
  • Clicks Level 2 Otherwise

This division is done to facilitate the training of 3 Level Model:

  • Level 1 Classifier which predicts 0 (Clicks Level 0) or 1 (Clicks Level other than 0)
  • Level 2 Classifier which predicts 0 (Clicks Level 1) or 1 (Clicks Level 2)
  • Level 3 Regressor which predicts the Net Clicks

Features

Extra Features added to the FDF are:

  • Mean of the Bert Embedding Vector for Keyword
  • Median of the Bert Embedding Vector for Keyword
  • Number of Words in the Keyword
  • Number of Characters in the Keyword
  • Average Number of Words in the Keyword
  • Number of Stop Words in the Keyword
  • Number of Digits in the Keyword
  • Day of the Week

Training

Outlier Removal

The data is grouped by date and outliers are removed from each group using standard z score normalization. Only +-2 std is kept

z=df.groupby('date').transform(lambda group: (group - group.mean()).div(group.std()))
outliers = z.abs()> stds
df_out=df[outliers.any(axis=1)]

Level 1 Model

XG Boost Classifier is trained on the data before 2019-10-10 to predict between clicks level 0 or other. The rest of the data is used for testing. Model has 20 estimators and 6 depth. These parameters were selected using Cross Validation and Time Series Split.

Confussion Matrix

Level 1 Confussion Matrix

Feature Importance

Level 1 Feature Importances

Level 2 Model

XG Boost Classifier is trained on the data before 2019-10-10 predict between clicks level 1 or 2. The rest of the data is used for testing. Model has 25 estimators and 5 depth. These parameters were selected using Cross Validation and Time Series Split.

Confussion Matrix

Level 2 Confussion Matrix

Feature Importance

Level 2 Feature Importances

Level 3 Model

XG Boost Classifier is trained on the data before 2019-10-10 predict between clicks level 1 or 2. The rest of the data is used for testing. Model has 40 estimators and 8 depth.

Results

The data after 2019-10-10 is used for testing.

Combined Confussion Matrix of Level 1 and Level 2 Model

Combined Confussion Matrix

Performance

Metric Value
Mean Absolute Error 0.028
Root Mean Squared Error 0.272
Variance Score 0.628
False Negatives 5%
False Positives 1.8%

Model Comparison

Metrics Comparison

Metrics New

Dependencies

  • Pandas
  • Numpy
  • Transformers
  • XGBoost
  • Nltk
  • Pytorch
  • Sickit-Learn