bachmannpatrick/CLVTools

Evaluating model performance - missing?

SSMK-wq opened this issue · 3 comments

Thanks for this wonderful package and awesome tutorial.

My question is on assessing the model performance. Based on the screenshot from walkthrough page, I have two questions and they are

image

a) Isn't there no train and test split for building and assessing the models? then is training using full dataset recommended or right thing to do? If yes, can shed some insights on how training using full dataset is useful for models like this (and how they are different from traditional ML models like random forest etc)

b) I also see that model overpredicts the spending of a customer who has ZERO as actual value. And this behavior is same for all other customers with Zero as actual spending. Does model take population mean or something for prediction? So, how do we assess the model performance?

Should we compute traditional "R2", "RMSE" etc?

Is there any inbuilt approach within CLVTools package to assess the model performance

Hi!

a) There is an estimation period and a holdout period (if specified). The estimation split is defined in clvdata() using the estimation.split argument. If estimation.split=NULL there is no holdout period.

b) A customer having 0 spending during the holdout period, does not mean that this customer has not purchased during the estimation period. There was just no transaction (or only transactions with Price=0) during the holdout period. The population mean is only used if a customer is not observed at all (no repeated transactions during estimation period).

We do not include an approach in CLVTools to assess model performance. I would suggest using the accuracy() command of the package forecast.

Best,
Patrick

@bachmannpatrick - quick follow up questions

a) I tried using estimation split but guess CLVTools puts a restriction on selecting a cohort of users (from same startimg point like same month, same quarter etc). Since our dataset is 5 years long and not majority of customers start at the same time, my code doesn't work because of lack of sufficient points under estimation etc. Is there anyway to get around this? If I set estimation period = Null, then there will be no testing set. So, in probabilistic modelling, is it okay to not treat them as regular ml modelling with train and test

b) In the above screenshot, for id = 1 and 10, let's assume he was part of estimation set with 0 repeat purchase records. So, value predicted for him will be population mean?

Most approaches for modeling CLV (no matter, if probabilistic or not) start from the assumption that you split up your data in cohorts (and thereby, look at customers' tenure as the relevant time reference and not calendar time). For more information, see here: https://github.com/bachmannpatrick/CLVTools/issues/172#issuecomment-862151684 . Meaning, you will estimate a separate model for each cohort.

For every, cohort dataset you can choose an individual train/test period. Depending on your data, we often recommend to use at least 1 year for training.

The toy dataset that comes with CLVTools are the purchase records for a single cohort of a apparel retailer (https://rdrr.io/cran/CLVTools/man/apparelTrans.html).

As far as I understand, following the standard workflow for modeling CLV should solve the issues that you describe above.