/dslr-42

Data science project. Logistic regression from scratch. Machine learning to mimic Harry Potter's Sorting Hat and to predict Hogwart's students house with 99% accuracy.

Primary LanguagePythonBSD 2-Clause "Simplified" LicenseBSD-2-Clause

GitHub commit activity

dslr-42

Datascience X Logistic Regression

ML sorting hat

Intro

The DataScience x Logistic Regression (dslr) is a 42's school project, on the data branch of the Holygraph. As an initiation to Machine Learning, it consists of training a multivariate logistic regression model to solve a classification problem. This our solution dslr subject, done by a team of two, in Python language. The use of Python functions or modules that are doing most of the job for statistics or machine learning, such as scikit-learn, are forbiden by 42 for this project.

The main objective is to recreate a ✨ magic Sorting Hat 🎓 ✨ to predict Hogwarts student houses.
When [Harry Potter's universe](https://www.wizardingworld.com/) meets a Data scientist.

Data understanding

The training dataset consists of 1600 students 🧙 caracteristics, with 17 features :

  • Four Biographic features First Name Last Name Birthday Best Hand.

  • A set of 13 wizard skills features being refered as : Arithmancy Astronomy Herbology Defense Against the Dark Arts Divination Muggle Studies Ancient Runes History of Magic Transfiguration Potions Care of Magical Creatures Charms Flying.

A model is trained, based on specific selected features, so that it can predict student's affiliation to one of the four 🏰 hogwart's houses

🦅 Gryffindor

🦡 Hufflepuff

🐦‍⬛ Ravenclaw

🐍 Slytherin

The targeted accuracy for predicting testing dataset should be above 98%.


Usage

config file

In dslr module, a config.py file contains configuration for scripts to be run : Directories and files names, but also features to be trained and regression parameters.

Virtual environment (venv)

make to install the virtual environment with its requirements.

For virtual environment venv activation:

source venv/bin/activate

Entrypoints scripts

Entrypoints scripts are at the root of the project. They run script from a subshell with os.system.

Program Arguments Action
describe.py [dataset] Describing a datasetwith statistics
logreg_train.py [dataset] Training logistic regression model from a training dataset
logreg_test.py [dataset] [weights] Testing logistic regression model with a testing dataset
histogram.py [dataset] [feature] plots a histogram for a given dataset feature
scatter_plot.py [dataset] [feature_1] [feature_2] Plots a scatter-plot for 2 given features
pair_plot.py [dataset] Plots a triangle-matrix of scatter-plots and distrbution for all dataset features

Describing the dataset :

(venv) ➜ dslr-42 git:(main) ✗ python describe.py ./datasets/dataset_train.csv

The training script takes a dataset file as argument :

python ./dslr/logreg_train.py ./datasets/dataset_train.csv

For prediction :

python ./logreg_predict.py ./datasets/dataset_test.csv ./logreg_model/weights.csv

Screenshot :

Dataset scripts

describe.py

describe.py mimics pandas library describe() function. A data file must be provided as argument.

Describing the training dataset :

python ./dslr/describe.py datasets/dataset_train.csv

Output:

         Index Arithmancy Astronomy  ... Care of Magical Creatures   Charms   Flying
count  1600.00    1566.00   1568.00  ...                   1560.00  1600.00  1600.00
mean    799.50   49634.57     39.80  ...                     -0.05  -243.37    21.96
std     462.02   16679.81    520.30  ...                      0.97     8.78    97.63
min       0.00  -24370.00   -966.74  ...                     -3.31  -261.05  -181.47
25%     399.75   38511.50   -489.55  ...                     -0.67  -250.65   -41.87
50%     799.50   49013.50    260.29  ...                     -0.04  -244.87    -2.51
75%    1199.25   60811.25    524.77  ...                      0.59  -232.55    50.56
max    1599.00  104956.00   1016.21  ...                      3.06  -225.43   279.07

Plots

Required plots :

Plots that are required

  • Histogram

python . histogram.py ./datasets/dataset_train.csv

  • Scatter Plot

python scatter_plot.py ./datasets/dataset_train.csv

  • Pair plot Matrix

python pair_plot.py ./datasets/dataset_train.csv

Scatter plots matrix visualization of students features. As preliminary work, we investigated the relationship between two variables taken two-by-two. From there, we selected features that suits the best to train our model.

Pair plot

Additional plots

  • Box plot -b option

python dslr/plot_dataset.py -b ./datasets/dataset_train.csv

  • Joint plot -j option Joint plot is a nice combination of scatter plot and density distribution.

python dslr/plot_dataset.py ./datasets/dataset_train.csv

Notebooks

Other plots in notebooks :

multi-box plots and many heatmaps

Subject

42 dslr subject

Mandatory part

Describe from scratch

A describe.py program to describe the dataset, that behaves as nympy.describe(). It is forbidden to use any function that makes the job, like: count, mean, std, min, max, percentile, etc...

Logistic regression training

multi-classifier using a logistic regression one-vs-all logreg_train.[extension] dataset_train.csv

Gradient descent algoritm to minimize the error

Generates a file containing the model weights.

Usage :

logreg_predict.[extension] dataset_train.csv [weights]

Prediction

Predict from '.datasets/dataset_test.csv' and generate a prediction file `houses.csv`` formatted exactly as follows:

$> cat houses.csv
Index,Hogwarts House
0,Gryffindor
1,Hufflepuff
2,Ravenclaw
[...] 

Bonus

• Add more fields for describe.py • Implement a stochastic gradient descent • Implement other optimization algorithms (Batch GD/mini-batch GD/ you name

Peer-Evaluation

Answers will be evaluated using accuracy score of the Scikit-Learn library. Professor McGonagall agrees that your algorithm is comparable to the Sorting Hat only if it has a minimum precision of 98% .

Some helpful links

Jupyter

Jupyter notebooks were used for dataset exploration.

Directory Structure

Project directory structure was organized accordingly with the following guidelines.

The Hitchhiker's Guide to Python - Structuring Your Project

CookieCutter utility

How To Structure a Data Science Project: A Step-by-Step Guide

Virtual environment

A Python virtual environment is installed and set up so that this project is self-contained, isolated from the system Python and from other projects virtual environments. The virtual environment has its own Python Interpreter and dependencies as third-party libraries that are installed from requirement.txt file specifications. It avoids system pollution, dependency conflicts and optimizes reproducibility for a data science project. We used virtualenv tool for dependency management and project isolation. Instead of using bash script, we chose to exploit Makefile capabilities and readability for generic management tasks.

Makefile and entrypoint

Makefile: the secret weapon for ML project management

Makefile - Make for Data Science

setup.py script (french)

setup.py

A setup.py file is a standard way in Python to specify how your project should be installed, packaged, and distributed. This file is used by tools like setuptools and pip to manage the installation process. The setup() function within setup.py is used to define various metadata about your project, such as its name, version, dependencies, and other details. python setup.py install to install your project locally.

setuptools

Having a setup.py becomes especially important if you plan to distribute your code as a Python package, whether through the Python Package Index (PyPI) or other distribution channels. It helps other developers easily install and use your project and allows tools like pip to manage dependencies.

format, width, precision

Precision

pandas dataframes and np.arrays

subsetting data Numpy hierachy

for the describe.py part

Argument parser = argparse Exceptions Pandas describe doc numpy statistics numpy percentile Math and statistics online calculator skweness and kurtosis

subset data

for the logistic regression part

Kaggle : logistic regression from scratch

Plots

constrained layout

testing

unittest unittest (in french) unittest tutorial - openclassrooms

Tests

Test runner chosen : unittest included in Python standard library.

./dslr/tests/testDescribe.py compares DescriberClass and pandas.describe()

./dslr/tests/testUtilsMath.py compares utils.math.py functions and numpy / pandas equivalent functions

pyinstaller module

An executable application .exe could be built with pyinstaller