The DataScience x Logistic Regression (dslr
) is a 42's school project, on the data branch of the Holygraph. As an initiation to Machine Learning, it consists of training a multivariate logistic regression model to solve a classification problem.
This our solution dslr
subject, done by a team of two, in Python
language.
The use of Python
functions or modules that are doing most of the job for statistics or machine learning, such as scikit-learn
, are forbiden by 42 for this project.
The main objective is to recreate a ✨ magic Sorting Hat 🎓 ✨ to predict Hogwarts student houses.
When [Harry Potter's universe](https://www.wizardingworld.com/) meets a Data scientist.
The training dataset consists of 1600 students 🧙 caracteristics, with 17 features :
-
Four Biographic features
First Name
Last Name
Birthday
Best Hand
. -
A set of 13 wizard skills features being refered as :
Arithmancy
Astronomy
Herbology
Defense Against the Dark Arts
Divination
Muggle Studies
Ancient Runes
History of Magic
Transfiguration
Potions
Care of Magical Creatures
Charms
Flying
.
A model is trained, based on specific selected features, so that it can predict student's affiliation to one of the four 🏰 hogwart's houses
🦅 Gryffindor
🦡 Hufflepuff
🐦⬛ Ravenclaw
🐍 Slytherin
The targeted accuracy for predicting testing dataset should be above 98%.
In dslr
module, a config.py
file contains configuration for scripts to be run : Directories and files names, but also features to be trained and regression parameters.
make
to install the virtual environment with its requirements.
For virtual environment venv
activation:
source venv/bin/activate
Entrypoints scripts are at the root of the project. They run script from a subshell
with os.system
.
Program | Arguments | Action |
---|---|---|
describe.py |
[dataset] | Describing a datasetwith statistics |
logreg_train.py |
[dataset] | Training logistic regression model from a training dataset |
logreg_test.py |
[dataset] [weights] | Testing logistic regression model with a testing dataset |
histogram.py |
[dataset] [feature] | plots a histogram for a given dataset feature |
scatter_plot.py |
[dataset] [feature_1] [feature_2] | Plots a scatter-plot for 2 given features |
pair_plot.py |
[dataset] | Plots a triangle-matrix of scatter-plots and distrbution for all dataset features |
Describing the dataset :
(venv) ➜ dslr-42 git:(main) ✗ python describe.py ./datasets/dataset_train.csv
The training script takes a dataset file as argument :
python ./dslr/logreg_train.py ./datasets/dataset_train.csv
For prediction :
python ./logreg_predict.py ./datasets/dataset_test.csv ./logreg_model/weights.csv
Screenshot :
describe.py
mimics pandas library describe()
function.
A data file must be provided as argument.
Describing the training dataset :
python ./dslr/describe.py datasets/dataset_train.csv
Output:
Index Arithmancy Astronomy ... Care of Magical Creatures Charms Flying
count 1600.00 1566.00 1568.00 ... 1560.00 1600.00 1600.00
mean 799.50 49634.57 39.80 ... -0.05 -243.37 21.96
std 462.02 16679.81 520.30 ... 0.97 8.78 97.63
min 0.00 -24370.00 -966.74 ... -3.31 -261.05 -181.47
25% 399.75 38511.50 -489.55 ... -0.67 -250.65 -41.87
50% 799.50 49013.50 260.29 ... -0.04 -244.87 -2.51
75% 1199.25 60811.25 524.77 ... 0.59 -232.55 50.56
max 1599.00 104956.00 1016.21 ... 3.06 -225.43 279.07
Plots that are required
- Histogram
python . histogram.py ./datasets/dataset_train.csv
- Scatter Plot
python scatter_plot.py ./datasets/dataset_train.csv
- Pair plot Matrix
python pair_plot.py ./datasets/dataset_train.csv
Scatter plots matrix visualization of students features. As preliminary work, we investigated the relationship between two variables taken two-by-two. From there, we selected features that suits the best to train our model.
- Box plot
-b option
python dslr/plot_dataset.py -b ./datasets/dataset_train.csv
- Joint plot
-j option
Joint plot is a nice combination of scatter plot and density distribution.
python dslr/plot_dataset.py ./datasets/dataset_train.csv
Other plots in notebooks :
multi-box plots and many heatmaps
A describe.py
program to describe the dataset, that behaves as nympy.describe()
. It is forbidden to use any function that makes the job,
like: count, mean, std, min, max, percentile, etc...
multi-classifier using a logistic regression one-vs-all
logreg_train.[extension] dataset_train.csv
Gradient descent algoritm to minimize the error
Generates a file containing the model weights.
Usage :
logreg_predict.[extension] dataset_train.csv [weights]
Predict from '.datasets/dataset_test.csv' and generate a prediction file `houses.csv`` formatted exactly as follows:
$> cat houses.csv
Index,Hogwarts House
0,Gryffindor
1,Hufflepuff
2,Ravenclaw
[...]
• Add more fields for describe.py • Implement a stochastic gradient descent • Implement other optimization algorithms (Batch GD/mini-batch GD/ you name
Answers will be evaluated using accuracy score of the Scikit-Learn library. Professor McGonagall agrees that your algorithm is comparable to the Sorting Hat only if it has a minimum precision of 98% .
Jupyter notebooks were used for dataset exploration.
Project directory structure was organized accordingly with the following guidelines.
The Hitchhiker's Guide to Python - Structuring Your Project
How To Structure a Data Science Project: A Step-by-Step Guide
A Python virtual environment is installed and set up so that this project is self-contained, isolated from the system Python and from other projects virtual environments.
The virtual environment has its own Python Interpreter and dependencies as third-party libraries that are installed from requirement.txt
file specifications. It avoids system pollution, dependency conflicts and optimizes reproducibility for a data science project. We used virtualenv
tool for dependency management and project isolation. Instead of using bash
script, we chose to exploit Makefile
capabilities and readability for generic management tasks.
Makefile: the secret weapon for ML project management
Makefile - Make for Data Science
A setup.py file is a standard way in Python to specify how your project should be installed, packaged, and distributed. This file is used by tools like setuptools
and pip
to manage the installation process. The setup()
function within setup.py
is used to define various metadata about your project, such as its name, version, dependencies, and other details.
python setup.py install
to install your project locally.
Having a setup.py becomes especially important if you plan to distribute your code as a Python package, whether through the Python Package Index (PyPI) or other distribution channels. It helps other developers easily install and use your project and allows tools like pip to manage dependencies.
subsetting data Numpy hierachy
Argument parser = argparse Exceptions Pandas describe doc numpy statistics numpy percentile Math and statistics online calculator skweness and kurtosis
Kaggle : logistic regression from scratch
unittest unittest (in french) unittest tutorial - openclassrooms
Test runner chosen : unittest
included in Python standard library.
./dslr/tests/testDescribe.py
compares DescriberClass
and pandas.describe()
./dslr/tests/testUtilsMath.py
compares utils.math.py
functions and numpy / pandas equivalent functions
An executable application .exe
could be built with
pyinstaller