This is the description of our solution in this hackathon that achieved 1st place on the private leaderboard with a score of 3.5123. The challenge itself was to determine the most vulnerable wards in South Africa due to the CoVID-19 pandemic using old data.
We quickly noticed that the data was small, and since it was old and census data, we figured it was going to be messy. All in all, we did some data cleaning in order to eliminate data points that could seriously damage the model's performance. We also applied clustering early on, and then we went on to try a considerable number of feature interactions since all features were percentages except a couple. We also looked at the target's behavior. The interactions were tried by probing the leaderboard and seeing their effectiveness one by one since the validation score was not reliable at all. In the end, we came up with over a hundred features out of the original ones but we settled with a few that we handpicked. And finally we applied PCA to wrap it all up for some dimensionality reduction.
Our model was a single xgboost ( We tried lightgbm and catboost early on, but xgboost seemed to outperform both of them in this particular challenge ) that was manually tuned and tested over and over again. We went for a single strong model rather than a number of weak models and an ensemble which paid off.
Never give up trying even at the end of a challenge. We basically kept 1st place in the last hour before the competition ended or we would have placed 2nd.
Do not hesitate to try ideas that seem crazy or useless in the context of such a challenge given the amount of submissions we were given ( Some of the stupidest of ours worked and got us a better score ).
A tip in hackathons is to always set up a quick baseline with raw features and a raw model, set up a score and try to beat that in every run you do.