Question: how was the model validated?
gshotwell opened this issue · 1 comments
My apologies if I'm not reading your modelling code correctly, but it seems like you're training a high dimensional GBM on a biased dataset and are then using that model to generate estimates for countries which are really different from the countries included in the training data. For example your data doesn't include many low-income, or low-life expectancy countries despite the fact that both of those factors are probably related to excess mortality. I also think there are also only four, non-representative, African countries.
I see that there's some cross-validation happening, but a GBM model that big can easily overfit this data even with cross validation, so how are you validating that your model's out-of-sample predictions are correct?
Hi @gshotwell,
Thanks for the input, and the neat chart. I am not sure what you mean by biased here - but I'll assume you mean it in the sense of us having more data from richer countries (or countries with higher life expectancies). That is the case - we surely wished we had data from a wider set of countries, and have spent much effort in trying to acquire that. For instance, you will see that we have data from several subnational units in India, as well as Indonesia. We point out how the training data we have access to this leads our models to be especially uncertain in poorer countries both on the page itself and in our methodology.
However, while this greatly reduces our precision, it does not directly introduce bias (here used in the sense of errors systematically off in one direction or the other) in our estimates as far as I can tell. (As a side note, I would recommend using PPP-adjusted incomes for the above chart, as they would be the relevant ones).
A few notes that might clarify: With regards to cross-validation, that is something we do, and they validate that the out-of-sample predictions of the approach are largely correct (we detail this in our methodology). (Naturally, we cannot validate that our estimates or model are correct in countries for which no total mortality data exists - that would be impossible.)
I am not sure what you are referring to when you say that a GBM model can easily overfit, or how its size enters into this. We use an information-criterion approach to avoid overfitting (there is a link in the readme to the relevant paper detailing this), not one based on cross-validation of hyperparameters. Gradient boosting algorithms, be the algorithm we use or competing ones such as xgboost or catboost, are also preferred by some many precisely because they do well at out-of-sample prediction. Income and life expectancy are also included in our models explicitly as predictors, meaning that if such patterns exist, the model will adjust predictions accordingly (as far as it can discover such relationships within our training data).
If you have any ideas or sources of data from other settings, do please let us know so we can incorporate it and improve our models. Also welcome any suggestions on how to improve the model itself or any competing algorithm that does better in cross-validation. Thanks!