dslr-42

Datascience X Logistic Regression

Intro
Data understanding
Usage
Describe Data: describe.py
Plots
Subject
Some helpful links

Intro

The DataScience x Logistic Regression (`dslr`) is a 42's school project, on the data branch of the Holygraph. As an initiation to Machine Learning, it consists of training a multivariate logistic regression model to solve a classification problem. This our solution `dslr` subject, done by a team of two, in `Python` language. The use of `Python` functions or modules that are doing most of the job for statistics or machine learning, such as `scikit-learn`, are forbiden by 42 for this project.

The main objective is to recreate a ✨ magic Sorting Hat 🎓 ✨ to predict Hogwarts student houses.
When [Harry Potter's universe](https://www.wizardingworld.com/) meets a Data scientist.

Data understanding

The training dataset consists of 1600 students 🧙 caracteristics, with 17 features :

Four Biographic features First Name Last Name Birthday Best Hand.
A set of 13 wizard skills features being refered as : Arithmancy Astronomy Herbology Defense Against the Dark Arts Divination Muggle Studies Ancient Runes History of Magic Transfiguration Potions Care of Magical Creatures Charms Flying.

A model is trained, based on specific selected features, so that it can predict student's affiliation to one of the four 🏰 hogwart's houses

🦅 Gryffindor

🦡 Hufflepuff

🐦‍⬛ Ravenclaw

🐍 Slytherin

The targeted accuracy for predicting testing dataset should be above 98%.

Usage

config file

In dslr module, a config.py file contains configuration for scripts to be run : Directories and files names, but also features to be trained and regression parameters.

Virtual environment (venv)

make to install the virtual environment with its requirements.

For virtual environment venv activation:

source venv/bin/activate

Entrypoints scripts

Entrypoints scripts are at the root of the project. They run script from a subshell with os.system.

Program	Arguments	Action
`describe.py`	[dataset]	Describing a datasetwith statistics
`logreg_train.py`	[dataset]	Training logistic regression model from a training dataset
`logreg_test.py`	[dataset] [weights]	Testing logistic regression model with a testing dataset
`histogram.py`	[dataset] [feature]	plots a histogram for a given dataset feature
`scatter_plot.py`	[dataset] [feature_1] [feature_2]	Plots a scatter-plot for 2 given features
`pair_plot.py`	[dataset]	Plots a triangle-matrix of scatter-plots and distrbution for all dataset features

Describing the dataset :

(venv) ➜ dslr-42 git:(main) ✗ python describe.py ./datasets/dataset_train.csv

The training script takes a dataset file as argument :

python ./dslr/logreg_train.py ./datasets/dataset_train.csv

For prediction :

python ./logreg_predict.py ./datasets/dataset_test.csv ./logreg_model/weights.csv

Screenshot :

describe.py

describe.py mimics pandas library describe() function. A data file must be provided as argument.

Describing the training dataset :

python ./dslr/describe.py datasets/dataset_train.csv

Output:

         Index Arithmancy Astronomy  ... Care of Magical Creatures   Charms   Flying
count  1600.00    1566.00   1568.00  ...                   1560.00  1600.00  1600.00
mean    799.50   49634.57     39.80  ...                     -0.05  -243.37    21.96
std     462.02   16679.81    520.30  ...                      0.97     8.78    97.63
min       0.00  -24370.00   -966.74  ...                     -3.31  -261.05  -181.47
25%     399.75   38511.50   -489.55  ...                     -0.67  -250.65   -41.87
50%     799.50   49013.50    260.29  ...                     -0.04  -244.87    -2.51
75%    1199.25   60811.25    524.77  ...                      0.59  -232.55    50.56
max    1599.00  104956.00   1016.21  ...                      3.06  -225.43   279.07

Plots

Required plots :

Plots that are required

Histogram

python . histogram.py ./datasets/dataset_train.csv

Scatter Plot

python scatter_plot.py ./datasets/dataset_train.csv

Pair plot Matrix

python pair_plot.py ./datasets/dataset_train.csv

Scatter plots matrix visualization of students features. As preliminary work, we investigated the relationship between two variables taken two-by-two. From there, we selected features that suits the best to train our model.

Additional plots

Box plot -b option

python dslr/plot_dataset.py -b ./datasets/dataset_train.csv

Joint plot -j option Joint plot is a nice combination of scatter plot and density distribution.

python dslr/plot_dataset.py ./datasets/dataset_train.csv

Notebooks

Other plots in notebooks :

multi-box plots and many heatmaps

Subject

42 dslr subject

Mandatory part

Describe from scratch

A describe.py program to describe the dataset, that behaves as nympy.describe(). It is forbidden to use any function that makes the job, like: count, mean, std, min, max, percentile, etc...

Logistic regression training

multi-classifier using a logistic regression one-vs-all logreg_train.[extension] dataset_train.csv

Gradient descent algoritm to minimize the error

Generates a file containing the model weights.

Usage :

logreg_predict.[extension] dataset_train.csv [weights]

Prediction

Predict from '.datasets/dataset_test.csv' and generate a prediction file `houses.csv`` formatted exactly as follows:

$> cat houses.csv
Index,Hogwarts House
0,Gryffindor
1,Hufflepuff
2,Ravenclaw
[...]

Bonus

• Add more fields for describe.py • Implement a stochastic gradient descent • Implement other optimization algorithms (Batch GD/mini-batch GD/ you name

Peer-Evaluation

Answers will be evaluated using accuracy score of the Scikit-Learn library. Professor McGonagall agrees that your algorithm is comparable to the Sorting Hat only if it has a minimum precision of 98% .

Some helpful links

Jupyter

Jupyter notebooks were used for dataset exploration.

Directory Structure

Project directory structure was organized accordingly with the following guidelines.

The Hitchhiker's Guide to Python - Structuring Your Project

CookieCutter utility

How To Structure a Data Science Project: A Step-by-Step Guide

Virtual environment

A Python virtual environment is installed and set up so that this project is self-contained, isolated from the system Python and from other projects virtual environments. The virtual environment has its own Python Interpreter and dependencies as third-party libraries that are installed from requirement.txt file specifications. It avoids system pollution, dependency conflicts and optimizes reproducibility for a data science project. We used virtualenv tool for dependency management and project isolation. Instead of using bash script, we chose to exploit Makefile capabilities and readability for generic management tasks.

Makefile and entrypoint

Makefile: the secret weapon for ML project management

Makefile - Make for Data Science

setup.py script (french)

setup.py

A setup.py file is a standard way in Python to specify how your project should be installed, packaged, and distributed. This file is used by tools like setuptools and pip to manage the installation process. The setup() function within setup.py is used to define various metadata about your project, such as its name, version, dependencies, and other details. python setup.py install to install your project locally.

setuptools

Having a setup.py becomes especially important if you plan to distribute your code as a Python package, whether through the Python Package Index (PyPI) or other distribution channels. It helps other developers easily install and use your project and allows tools like pip to manage dependencies.

Tests

Test runner chosen : unittest included in Python standard library.

./dslr/tests/testDescribe.py compares DescriberClass and pandas.describe()

./dslr/tests/testUtilsMath.py compares utils.math.py functions and numpy / pandas equivalent functions

pyinstaller module

An executable application .exe could be built with pyinstaller

shameleon/dslr-42

dslr-42

Datascience X Logistic Regression

Intro

Data understanding

Usage

config file

Virtual environment (venv)

Entrypoints scripts

describe.py

Plots

Required plots :

Additional plots

Notebooks

Subject

Mandatory part

Describe from scratch

Logistic regression training

Prediction

Bonus

Peer-Evaluation

Some helpful links

Jupyter

Directory Structure

Virtual environment

Makefile and entrypoint

setup.py

format, width, precision

pandas dataframes and np.arrays

for the describe.py part

for the logistic regression part

Plots

testing

Tests

pyinstaller module