A 42 school project, from the machine learning / artificial intelligence branch.
A project to build a neural network from scratch. The goal is to train, then predict if a cell is malign and benign (using WDBC dataset, more infos inside
data_infos.txt
).
There's only one program : python3 multilayer_perceptron.py
with multiples options.
If not specified overwise, all options are compatible.
python3 multilayer_perceptron.py --dataset path_to_data
Option to specify the data path on which will be trained the model. Dataset path must be a valid
.csv
file with correct authorizations. Note that it need to follow the same format (presence of id as index, labels either 'M' or 'B', same number of numerical features) as the default optiondata.csv
to be functional.
python3 multilayer_perceptron.py --seed number
number
must be a valid integer. Randomness is used to initialize each neural layer. Using this option allow the user to get reproductible results (same weights if seed is kept). By default, no seed is set.
python3 multilayer_perceptron.py --metrics accuracy precision recall f1
By default, only categorical cross entropy loss is displayed (both in terminal and in a final schema) and kept track of. User can choose to display more metrics (on terminal and final schema) with this option. One or a combinaison of
accuracy
,precision
,recall
orf1
(for F1 score) can be used.
python3 multilayer_perceptron.py --trainset_only
By default, the dataset is cut in both training set (80% of dataset, to train model on) and validation set (20%, to see if model generalize well or not and not used for training) and metrics are display for both. If this option is set, the cut isn't happening and model will be trained on all dataset. Non compatible with
--stop
and--show_only
options.
python3 multilayer_perceptron.py --epochs number
number
must be a valid positive integer. Number of iterations/epochs on which the model is trained. An epoch comprise a forward propagation and a backward propagation (with weights update). By default, 10000 epochs are performed.
python3 multilayer_perceptron.py --dim number_1 ... number_n
number_1 ... number_n
must be valid positive integers, separated by spaces. Option to set the hidden layers dimensions of the model. By default, an architecture with 2 hiddens layers of respectively 15 and 8 units is used. Note that all those hidden layers will use a ReLU activation function.
python3 multilayer_perceptron.py --opti optimization_algorithm
optimization algorithm
must either bermsprop
,adam
ormomentum
. Each option correspond to the equivalent algorithm and will be applied to the full neural network. By default, no optimization algorithm is used.
python3 multilayer_perceptron.py --alpha number
number
must be a valid positive float included between 0 and 1. Choice of learning rate hyperparameter (value by which weights are proportionally updated). By default, alpha is set to 0.001.
python3 multilayer_perceptron.py --name name
name
must be a string. Upon training, models are saved under pickle format. By default, the name isModel
and a correspondingModel.pkl
is created. This can be modified by this option. Note that if used with---predict
option,--name
will correspond to the.pkl
file from which the trained model will be retrieved.
python3 multilayer_perceptron.py --stop number
number
must be a valid positive integer. This option implement early stopping. Validation set loss in compared with it's previous best value (lower) after at leastnumber
epochs back. If an increase between previous and current value is seen, it may be because of a phenomenon called overfitting (model can't generalize well). Model go back to previous state where the smaller validation loss was met ad training stops. Non compatible with--trainset_only
option.
python3 multilayer_perceptron.py --lambda_ number
number
must be a valid positive float included between 0 and 1. Regularization is a way to prevent overfitting (model not able to generalize to data it wasn't trained on). If regularization is performed, it's factor (extent) must be set by this regularization factor. By default, L2 regularization is performed andlambda_
is set to 0.5.
python3 multilayer_perceptron.py --show_all
By default, after training, a final schema of displayed metrics evolution is shown. If this option is set, training is not happening and all metrics evolution of previously trained models (saved in
experiments.csv
) are displayed on a schema. Won't display anything ifexperiments.csv
is not present, have incorrect rights or is not under expected format. This option may present some python exceptions. Non compatible with--show_only
option.
python3 multilayer_perceptron.py --show_only
By default, after training, a final schema of displayed metrics evolution is shown. If this option is set, an additionnal schema of all metrics evolution of previously trained models (saved in
experiments.csv
) is displayed. Won't display anything ifexperiments.csv
is not present, have incorrect rights or is not under expected format. This option may present some python exceptions. Non compatible with--show_all
option.
python3 multilayer_perceptron.py --predict path_to_data
By default, program perform training. If this option is set, it'll perform prediction instead on the dataset indicated by
path_to_data
. Dataset path must be a valid.csv
file with correct authorizations. Note that it need to follow the same format (presence of id as index, labels either 'M' or 'B', same number of numerical features) as the default optiondata.csv
to be functional. Predicted labels are displayed, as well as the loss and others metrics (accuracy, precision, recall, F1 score). Non compatible with all options except--name--
option.
python3 multilayer_perceptron.py --reset
If set, delete
experiments.csv
file and all saved models in.pkl
files at program launch.
python3 multilayer_perceptron.py --reg regularization
regularization
must be eitherl2
orNone
. By default, set tol2
, which will make model perform regularization (a way to limit a phenomenon called overfitting, where model can't generalize well on data it wasn't trained on). Must be used with a--lambda_
option different of 0.0. If set toNone
, no regularization will be performed.
evaluation.py
is a program generating randomly a data_traning.csv
and data_test.csv
from a remote data set hosted on 42 website. This set is the same as data.csv
.
During evaluation, this script was launched multiple times and loss was compared to a point scale.
Note that model performance can be altered by the randomness of the testing set generated. That's why multiple prediction on differents testing set are performed.
I choose to make a "modular" implementation of neural network.
A SimpleNeuralNetwork
class is present, which is a multilayer perceptron implementation containing a multiple number of DenseLayer
class elements.
DenseLayer
is defined by multiple parameters (feel free to look at its implementation), including its activation function. By default, the activation function of all hidden layers was ReLU and the output layer activation was softmax in multilayer_perceptron.py
program. But those parameters can be modified at DenseLayer
intialization. Algorithm for forward and backward propagation, as well as update, are both presents in DenseLayer
object (for one layer) and SimpleNeuralNetwork
object (for all layers, calling methods of layers).
Though only a set of those objects parameters are modified in the main program, a lot of others options are available. Look it up !
Python ๐ Why ? Because it's the main language used in data science and machine learning nowadays.
- NumPy (version: 1.21.5)
- pandas (version 1.5.0)
- matplotlib (version 3.5.1)
- argparse (version 1.1)
- pickle (version 4.0)
Note that metrics results where compared with equivalent TensorFlow functions.