The forth Mini project from rakamin academy
To run this code, dont forget to install requirement. thanks to rakamin that give me this task, specially Mr. Abdullah Ghifari as my mentor and Mr. Travis Tang for the article about LazyPredict, reader can see the article via this link : https://pub.towardsai.net/lazypredict-run-all-sklearn-algorithms-with-a-line-of-code-29d73d82499c
Reader who want see my presentation you can see the pdf file. but if you dont want it just read below.
From the graph, we know that customer tend to click our ad if they are :
- Have <= 60 daily time spent on our site.
- Are >=40 years old.
- Have an income <= 3.2 hundred million
- And have <= 170 daily internet usage.
From the graph, we know that woman have a percentage above 50% to click on our ads, and customer is more interested with our ads about house, finance, fashion and automotive,this can be said because of the percentage customer click our ads above 50%.
From the graph, we know :
- The older our customer, have a few daily time spent on site and daily internet usage they tend to click our ads.
- The younger our customer, have a more daily time spent on site and daily internet usage they tend to not click our ads.
From graph, we know that Daily Internet Usage and Daily Time spent on site have a highest correlation between another column, the second highest correlation is Age with Daily Internet Usage and the third is area income with daily internet usage.
Using Chi-Square for know association from category column, we know that clicked on Ad have a high correlation with city and Male with category.
we have 4 columns that have null data, 3 of then is numeric and others is categories.
to fill null data, I use median for numeric columns and mode for categories column.
We get 5 new column, there are year, month, day, weekday and is_weekend, dont forget to remove timestamp column because that tabel cant be used to train ML model.
before we do feature transformation at categories columns, we divide them to 2 depends on their unique value.
Is used For Category column that have 2 unique value or ordinal data.
Is used for category column that have >2 unique value or nominal data.
we do it for numeric column, some ML model will have a better accuracy if scale from every numeric column is same. before do this, we need make a list that contains numeric columns.
we split the data into 2 parts, there are feature and target.
For this modelling, we will do some experiment, there are :
1. First model will be trained by data that numeric columns dont do a scaling
2. Second model will be trained by data that has passed all data preprocessing
And in this case, I will try library called Lazypredict to make a model, and another will be i try are LightGBM, RandomForest and XGBoost.
The top model is XGBClassifier from XGBoost model, there have high evalution score (like 96% accuracy), but the time to predict testing data is 0.14 (see time taken column), is 7x longer then logistic regression that have 94% accuracy.
So if you need model that have higher accuracy, you can use XGBClassifier but if you need model that have a faster time to predict you can use Logistic regression.
Compared with the result of experiment 1, there are no significant differences, just the time to predict test data is faster. We can look at XGBClassifier, at first experiment time taken is 0.14 second and at second experiment time taken is 0.09 second.
- The model is trained by preprocessing data to get better result.
- XGBClassifier from XGBoost get the better evaluation score then another model. But if you need model that can predict data faster than XGBoost I recommend to use logistic regression.
- Because I need model that have a better evaluation score, I will choose XGBClassifier.34-8e24-c3eabbe0dd84.png)
There are top 4 features that affect customers whether they click on our ads or not, namely Area income, Daily Internet Usage, Daily Time Spent on Site and Age.
Based on Anaysis and feature importance from model, it can be concluded that :
-
We need to increase showing our ads to customer that meet the following requirement : They have income maximum 3.2 hundred million, are >=40 years old, have <= 60 minutes daily time spent on our site and have <= 170 minutes daily internet usage.
-
For customer that don’t meet criteria at number 1, we need to decrease showing our ads because they are have low amount customer to click our ads. So that we can maximize our budget in advertising.
We have a balanced amount of data between targets (50% click our ads and 50% no click our ads). Let’s count if we don’t use our model/do business recommendation : Assumption : We show our ads at Google searchs Ads that have average CPM $38.40 (at 2021 via topdraw.com), let’s say if customer click our ads we got $0.1. so : Cost : $38.40 Revenue = (1000 * 50%) * $0.1 = $50 Profit = Revenue – Cost = $50 - $38.40 = $11.6
** read about CPM at tbis link : https://www.investopedia.com/terms/c/cpm.asp ** topdraw full link = https://www.topdraw.com/insights/is-online-advertising-expensive/
Now if we use a ML model that has 96% accuracy to determine whether our customers will see our ads or not. We can get a profit per 1000 views of (We use the same assumption as before) :
Cost : $38.40 Revenue = (1000 * 96%) * $0.1 = $96 Profit = Revenue – Cost = $96 - $38.40 = $57.6
Is nearly 5 times bigger than if we don’t use ML model.