A 42 school project, from the machine learning / artificial intelligence branch.
A quick project to carry out a One-vs-All logistic regression on a dataset containing Hogwarts students. Data statistical analysis and visualization must be performed prior to model training. The goal is to create a Sorting Hat model using students marks and others characteristics to assign them to one of the four houses.
python3 dslr_interactive.py
A terminal interactive scenario involving the famous headmistress, McGonagall, and performing all programs presents in this file (except for logreg_finetune.py
) . The user can choose to explains what is going on to the headmistress, which will display partial explanations for each program.
Project screen capture of interactive mode
python3 describe.py dataset_filepath
A program which take as argument the filepath to a dataset to analyze and output corresponding statistical informations. The goal is to reproduce the describe function of sklearn. Statistical informations displayed: Count, mean, standard deviation, minimum, first quartile, median, third quartile, maximum, mode, range, interquartile range and number of outliers.
python3 histogram.py
python3 histogram.py -expl
A program to display an histogram of marks distribution for each subject and each Hogwarts house.
The goal was to answer the question "Which subject at Hogwarts is homogeneously distributed between the four houses ?".
With -expl
argument, explanations from the interactive scenario are displayed.
Project screen shot
python3 scatter_plot.py
A program to display a scatter plot of marks distribution for each pair of subjects, and each Hogwarts house.
The goal was to find the two most similars subjects.
With -expl
argument, explanations from the interactive scenario are displayed.
Project screen shot
python3 scatter_plot.py
A program to display a pair plot of marks distribution of all Hogwarts lessons subjects, for all Hogwarts house.
The goal was to choose useful features to train our model on.
With -expl
argument, explanations from the interactive scenario are displayed.
Project screen shot
python3 logreg_train.py dataset_filepath
A program to train our One-vs-All model on a correct Hogwarts dataset (dataset_train.csv
) taken as argument.
The goal was to reach an accuracy of 98%.
Obtained weights are saved into thetas.npz
.
With -expl
argument, explanations from the interactive scenario are displayed.
python3 logreg_predict.py dataset_filepath
A program to attribute Hogwarts' students in one of the four houses.
It takes a correct dataset as argument (dataset_test.csv
), as well as a weights file (thetas.npz
).
Predictions are saved into houses.csv
.
With -expl
argument, explanations from the interactive scenario are displayed.
python3 logreg_finetune.py
A bonus program, to help find good hyperparameters to use as default value inside logreg_train.py
.
It trains the One-Vs-All model with randomly chosen hyperparameters and output the mean of ten differents training (with potential differents randomly initialized weights) with those hyperparameters.
100 experiments are run on the same training and testing sets. Results and hyperparameters values of those experiments are registered in experiments.csv
.
At the end of the program, median or mode of hyperparameters values helpful to reach best accuracy are displayed.
A lot of available options make more sense for neural networks than linear regression. They can still help reaching faster optimal parameters, reduce overfitting, ... But their implementation was mainly to understand their logic.
- RMSprop (more neural network oriented)
- Momentum (more neural network oriented)
- Adam (more neural network oriented)
- Learning rate decay
- Mini-batch or stochastic gradient descent
- L2 regularization (ridge linear regression)
- Early stopping
- Zeros
- Random small numbers (0-1)
- He initialization
- Max iter = Number of epochs
- Alpha = Learning rate
- Beta 1 = Value for momentum / Adam (more neural network oriented)
- Beta 2 = Value for RMSprop / Adam (more neural network oriented)
- Lambda = Value for L2 regularization
- Decay = Learning rate decay
- Decay interval = Interval to perform learning rate decay
- Epsilon = Small number to avoid computation problem in RMSprop / Adam
- Batch size = Size of a batch (1 = stochastic < mini-batch < max size = batch)
Python 🐍 Why ? Because it's the main language used in data science and machine learning nowadays.
- NumPy (version: 1.21.5)
- pandas (version 1.5.0)
- matplotlib (version 3.5.1)
- scikit-learn (version 1.1.2)
- playsound (version 1.3.0)
- argparse (version 1.1)
Note that seaborn could have been used for pair_plot.py
but, as 42 restrict the quantity of memory per student, I choose to use solely matplotlib for visualization.