SleepBot is a Click Prediction Bot which uses the data from google search console to predict the Net Clicks of Keyword for a given date.
The bot is trained using the FDF which has:
- 1368792 Records
- 74 Features
- 19269 Keywords
The Dataset is divided into three classes based on the Net Clicks:
- Clicks Level 0 If Net Clicks is 0
- Clikcs Level 1 If Net Cliks is 1
- Clicks Level 2 Otherwise
This division is done to facilitate the training of 3 Level Model:
- Level 1 Classifier which predicts 0 (Clicks Level 0) or 1 (Clicks Level other than 0)
- Level 2 Classifier which predicts 0 (Clicks Level 1) or 1 (Clicks Level 2)
- Level 3 Regressor which predicts the Net Clicks
Extra Features added to the FDF are:
- Mean of the Bert Embedding Vector for Keyword
- Median of the Bert Embedding Vector for Keyword
- Number of Words in the Keyword
- Number of Characters in the Keyword
- Average Number of Words in the Keyword
- Number of Stop Words in the Keyword
- Number of Digits in the Keyword
- Day of the Week
The data is grouped by date and outliers are removed from each group using standard z score normalization. Only +-2 std is kept
z=df.groupby('date').transform(lambda group: (group - group.mean()).div(group.std()))
outliers = z.abs()> stds
df_out=df[outliers.any(axis=1)]
XG Boost Classifier is trained on the data before 2019-10-10 to predict between clicks level 0 or other. The rest of the data is used for testing. Model has 20 estimators and 6 depth. These parameters were selected using Cross Validation and Time Series Split.
XG Boost Classifier is trained on the data before 2019-10-10 predict between clicks level 1 or 2. The rest of the data is used for testing. Model has 25 estimators and 5 depth. These parameters were selected using Cross Validation and Time Series Split.
XG Boost Classifier is trained on the data before 2019-10-10 predict between clicks level 1 or 2. The rest of the data is used for testing. Model has 40 estimators and 8 depth.
The data after 2019-10-10 is used for testing.
Metric | Value |
---|---|
Mean Absolute Error | 0.028 |
Root Mean Squared Error | 0.272 |
Variance Score | 0.628 |
False Negatives | 5% |
False Positives | 1.8% |
- Pandas
- Numpy
- Transformers
- XGBoost
- Nltk
- Pytorch
- Sickit-Learn