/Bing-Bang-BodyLevel

The project comprises tackling the supervised problem of body level classification: given numerical and categorical features regarding an individual’s health such as their height, weight, family history, age, and eleven others, the objective is to predict the body level of the individual (out of four possible levels).

Primary LanguageJupyter Notebook

👩‍🦯 Body Level Classification 🏃‍♂️

The project comprises tackling the supervised problem of body level classification: given numerical and categorical features regarding an individual’s health such as their height, weight, family history, age, and eleven others, the objective is to predict the body level of the individual (out of four possible levels).

🚀 Pipeline

Our solution to said problem utilitizes the following pipeline image

📂 Folder Structure

The following is the implied folder structure:

.
├── DataFiles
│ ├── dataset.csv
│ ├── train.csv
│ └── val.csv
├── DataPreparation
│ ├── CovarianceAnalysis.py
│ ├── DataPreparation.ipynb
│ └── DataPreparation.py
├── HandleClassImbalance
│ ├── HandleClassImbalance.ipynb
│ ├── HandleClassImbalance.py
├── ModelBaselines
│ └── Baseline.ipynb
├── Model Pipelines
│ ├── AdaBoost
│ │ └── Adaboost.ipynb
│ ├── Bagging
│ │ ├── Analysis.ipynb
│ │ └── SVMBagging.ipynb
│ ├── LogisticRegression
│ │ ├── Analysis.ipynb
│ │ └── LogisticRegression.ipynb
│ ├── Perceptron
│ │ ├── Analysis.ipynb
│ │ └── Perceptron.ipynb
│ ├── RandomForest
│ │ ├── Analysis.ipynb
│ │ └── RandomForest.ipynb
│ ├── SVM
│ │ ├── Analysis.ipynb
│ │ └── SVM.ipynb
│ ├── StackingEnsemble
│ │ └── StackingEnsemble.ipynb
│ ├── VotingEnsemble
│ │ └── VotingEnsemble.ipynb
│ ├── ModelAnalysis.py
│ ├── ModelVisualization.py
├── ModelScoring
│ └── Pipeline.py
├── References
│ └── ML Project Document.pdf
├── Saved
├── Quests
├── README.md
└── utils.py

🚁 Running the Project

pip install requirements.txt
# To run any stage of the pipeline, consider the stage's folder. There will always be a demonstration notebook.

📝 Executive Summary

We started by designing and running a dataset analysis pipeline (i.e., studying the target function) which has lead to our initiation of SVM, LR, GNB, RF, Perceptron and Adaboost models. Then we proceeded by designing a model analysis cycle that we implemented for each of these models with the objective of studying the model's performance and tuning over the best hyperparameters, set of features and data preparation choices.

Best results for WF1 are as follows under 10-Repeated-10-Fold Cross Validation (which get better under other cval approaches)

SVM Logistic Regression Random Forest
98.65% 98.38% 97.63%

The following visually depicts SVM's over the most important features output

Which we did not end up choosing in the end, instead we considered forming Ensembles of these models via Voting, Bagging and Stacking. Yielding the following

Bagging SVM Voting Stacked Generalization
98.37% 96.5% 99.12%

By this, our final model was stacking. It yields 99% on the competition's test set; as for why it did not make it high on the leaderboard, check the autopsy report.

We shall illustrate the whole pipeline including the analysis stages in the rest of the README. For an extensive overview of the the insights extracted and analysis results for the rest of the models please check the report or the demonstration notebooks herein.

🌊 Data Preparation

Data preparation involves reading the data and putting in a suitable form. Options employed in this stage beyond reading the data are:

  • To read specific splits of the data (by default train)
  • To read only columns of numerical or categorical types (or both)
  • Label,one-hot or frequency encoding for categorical features
  • To standardize the data

This module was used to ingest the data for all subsequent models and analysis.

🎨 Dataset Analysis

In light of guiding model initiation by studying the population and the target function we have performed the following analyses:

◉ Basic Counts & Variables Involved

Number of Samples Number of Features Number of Classes
1180 16 4
Variable Gender H_Cal_Consump Smoking Fam_Hist H_Cal_Burn Alcohol_Consump Food_Between_Meals Transport Age Height Weight Veg_Consump Water_Consump Meal_Count Phys_Act Time_E_Dev
#Uniques 2 2 2 2 2 3 4 5 Numerical Numerical Numerical Numerical Numerical Numerical Numerical Numerical

◉ Variable Distributions

In this, we analyzed the distribution of each variable via a bar chart or kernel estimated density depending on whether its numerical or categorical respectively.

image

◉ Prior Distribution

For purposes of studying imbalance between classes.

image

◉ Variable Correlations

We analyzed correlations among all pairs of numerical variables using Pearson's, then all pairs of categorical variables using Crammer's V and then all pairs of numerical and categorical variables using Pearson's correlation ratio.

image

◉ Separability with Numerical Variables

Here, the separability of the target is analyzed under all possible pairs of numerical variables.

image

◉ Separability with Categorical Variables

In this, we study the separability of the target under different categories of each categorical variable.

image

◉ Theoritical Generalization Guarantees

We automated a generalization check given any two of validation set size, maximum allowed error and probability of violating that error. This was used to inspire the decision of the number of splits for cross validation (tuning each model) and the size of the validation set (choosing between models).

Hoeffding's Inequality states: $$P[|E_{out}(g)-E_{test}(g)| \leq \epsilon] \geq 1-2e^{-2N_{test}\epsilon^2}$$ If we use validation set of size $0.2N_{train}=295$ then with $\epsilon=0.06$ we have $$P[|E_{out}(g)-E_{test}(g)| \leq 0.06] \geq 0.761$$ In other words, with probability at least $0.761$, the generalization error of our model will be at most 0.06 given a validation set of size 295.

🤖 Model Initiation

We considered two trivial baselines (MostFrequent and UniformRandom) and another nontrivial baseline (Gaussian Naive Bayes) so that we can set the bar regarding the bias of further models we consider. We then initiated and analyzed the following models:

  • SupportVectorMachines
  • LogisticRegression
  • Perceptron
  • RandomForest
  • AdaptiveBoosting

🛸 Model Analysis

We designed a unified analysis cycle that applies to any of the models as demonstrated in the report and the notebooks. It consists of the following stages at no particular order:

Analysis Stage Components
Model Greetings Initiating Model and Viewing Hyperparameters
Studying the Hyperparameters and their Importance (documentation)
Basic Model Analysis Testing Model Assumptions (if any)
VC Dimension Check for Generalization
Bias Variance Analysis
Learning Curve
Hyperparameter Analysis Validation Curves
Hyperparameter Search
Hyperparameter Logging
Feature Analysis Feature Importance
Recursive Feature Elimination
Class Imbalance Analysis Analyzing Different Methods
Analysis Different Hyperparameters

We will demonstrate this for Logistic Regression, for the extracted insights and other models report or the notebooks.

🤝 Model Greetings

C class_weight dual fit_intercept intercept_scaling l1_ratio max_iter multi_class n_jobs penalty random_state solver tol verbose warm_start
40.074 balanced False True 1 None 100 multinomial None l2 None newton-cg 0.0001 0 False

The purpose of this stage is to get familiar with the model and its hyperparameters which involved research or reading the documentation.

💡 Basic Model Analysis

◉ Testing Log-Linearity Assumption

image

◉ VC Dimension Analysis

By estimating the VC dimension of the model, we have $d_{vc}=37$. Since, $N=1477$, it holds that $$N \geq 10d_{vc}$$ Hence, model is expected to have no issues with generalization.

In this, the number of parameters of the model were used to estimate its generalization ability using the VC bound rule of thumb.

◉ Bias-Variance Analysis

Train WF1 Val WF1 Avoidable Bias Variance
0.986 0.981 0.014 0.005

The bias and variance of the model were heuristically computed here (Andrew NG's style)

◉ Learning Curve

image

This helps indicate the bias of the model and sheds light on whether it would benefit from adding more data.

🔎 Hyperparameter Analysis

Validation Curves

image

To study the effects of specific hyperparameters on the model's performance (in-sample and out-of-sample error) and mark the point where the model starts to overfit.

◉ Random Hyperparameter Search

C class_weight multi_class penalty solver WF1
40.074 balanced multinomial l2 newton-cg 0.98104

Here we used random search to find an optimal set of hyperparameters.

◉ Hyperparameter Logging

info read_data LogisticRegression metrics
time date duration id split kind standardize selected encode class_weight multi_class penalty solver dual tol fit_intercept intercept_scaling max_iter verbose warm_start train_wf1 val_wf1
16:00:51 05/14/23 49.76 s 3 train Numerical True True balanced multinomial l2 newton-cg False 0.0 True 1 100 0 False 0.985 0.9814
16:19:45 05/14/23 7.69 s 4 all Numerical True balanced multinomial l2 newton-cg False 0.0 True 1 100 0 False 0.991 0.9831
01:48:36 05/15/23 17.70 s 6 all Numerical True balanced multinomial l2 newton-cg False 0.0 True 1 100 0 False 0.986 0.9838

We used in-notebook experiment logging using the MLPath library . Shown above is a sample of the log table.

🚦 Features Analysis

◉ Feature Importance Analysis

image

In this, we analyzed the importance of each feature as assigned by the model's weights.

◉ Recursive Feature Elimination

image Top 3 Features

Veg_Consump Height Weight
0.32722 7.841 26.777

As suggested by one of Vapnik’s papers, a decent feature selection strategy is to remove the least important feature until a minimum number of features is reached or the metric is no longer improving.

⚖️ Class Imbalance Analysis

◉ Analyzing Different Methods

image

The purpose of this was to compare different resampling approaches and class-weighting.

◉ Analyzing Different Hyperparameters

image

Different hyperparameters within specific resampling approach(es) were analyzed here.

🏁 Model Evaluation, Ensemble and Final Delivery

As illustrated above.

📜 Conventions

We have set the following set of working standards before undertaking the project. If you wish to contribute for any reason then please respect such standards.

👥 Collaborators

EssamWisam
Essam
mariemzayn18
Mariem Muhammed
Marim1611
Marim Naser
Muhammad-saad-2000
MUHAMMAD SAAD

📈 Progress Tracking

We have utilized Notion for progress tracking and task assignment among the team.

💖 Thank you. 💖