Team GAN - Gabriel Dehame, Andrea Miele, Nicolas Moyne
helpers.py
is the provided file to load the dataset and create submissionsimplementations.py
implements the ML methods of Step 2score.py
implements functions to calculate the quality of a model (f1_score
,accuracy
)preprocessing.py
implements our preprocessing which removes useless features, computes through an ANOVA (Analysis of Variance) the most impactful features and thus those to keep in priority. The ANOVA is implemented inanova_selection.py
. It also performs an oversampling and undersampling to cope with the unbalanced dataset, these are implemented inOverUnderSampling.py
.utils.py
implements utilitary methods train models such asbuild_poly
computing polynomial extensions,standardize
standardizing data orpredict
computing the prediction for a given model and given datapoints It also contains a logistic regression with newton's method, cross-validation, local search and KNN but we ended up not using them because it was too slow and/or lead to strange results so they might be bugged.finetuning.py
implements the tuning of the hyperparameters for the models we compared.run.py
reproduces the training of the best performing model we've trained. For it to work, the dataset must be installed in a folderdataset/
f_scores_after_strat105_500.csv
is a file containing the precomputed ANOVA scores for the features of the dataset to avoid recomputing it each time as it's slow
The best submission on AICrowd is submission #243500, which is reproduced in run.py
We also got a F1 of 0.437 in submission #240074 on AICrowd. Unfortunately, the algorithm was not seeded and is therefore difficult to reproduce.