A consideration of the use of XGBoost as a replacement technique for imputation of missing values in official statistics.
This work was originally undertaken as part of a Data Science Academy project. A scheme internal to the UK's Office for National Statistics to allow its staff to do short (2 weeks) projects into machine learning techniques applied into domains which are known to the mentee supervised by members of @datasciencecampus.
This project was created for @Vinayak-NZ.
Techniques compared include:
- XGBoost
- CANCEIS
- RBEIS
- Mixed methods
Future work is currently ongoing outside of this repository for other approaches which includes multiple imputation and a consideration of the use of more sophisticated techniques (e.g. autoencoders).
There is also an alternate workstream considering the use of genetic algorithms for this type of work. This is an Academy project for a separate member of the imputation methodology team.
Future plans include a methodological consideration of the suitability of these techniques in practice and to provide a more abstract consideration of what the most suitable mechanisms for this would be.
The project write up is available on the gh-pages
branch and there is the
presentation given after completion of the work on the presentation
branch.