- This project is a part of the course "Machine Learning" at University Paris Dauphine, PSL.
- Problem 1: Implementation Logistic Regression and LDA
- Problem 2: Working with real-world data to evaluate implemented models
- Requirements:
Python
version 3.6 or higher. - Run file
Analysis.R
to get the results of statistical analysis - Install all required packages by running:
or you can just run the following command if you have
python3 setup.py install
pip
installed:pip install -r requirements.txt
- Run file
make_data_beautiful.py
to start preprocessing data processpython3 src/make_data_beautiful.py
- Run files
notebooks
to get the results of implemented models (Logistic Regression and LDA)./Notebooks/{LogisticRegression, LDA, NN}.ipynb
- Run files
comparator.ipynb
to get the results comparing implemented models withsklearn
models./Notebooks/comparator.ipynb
- Run files
notebooks
to get the results of explaining implemented models../Notebooks/explain_model.ipynb
Table 1: Results of implemented models (Logistic Regression and LDA) and sklearn
models
Model | Accuracy | Precision | Recall | F1-score | ROC AUC |
---|---|---|---|---|---|
Logistic Regression | 0.981481 | 0.972973 | 0.972973 | 0.972973 | 0.979444 |
LDA | 0.972222 | 1.0 | 0.918919 | 0.957746 | 0.959459 |
Neural Network | 0.990741 | 0.973684 | 1.0 | 0.986667 | 0.992958 |
Linear SVM | 0.981481 | 0.972973 | 0.972973 | 0.972973 | 0.979444 |
Ridge | 0.953704 | 1.0 | 0.864865 | 0.927536 | 0.932432 |
XGBoost | 0.962963 | 0.945946 | 0.945946 | 0.945946 | 0.958888 |
- The results of statistical analysis are in directory
plots/
- The results of implemented models are in directory
src/output_plots/
- Models are saved in directory
src/output_models/
- HTML files for investigating missed predictions of logistic regression
src/logistic_missed_predict_investigate/
-
src/
: source codedata/
: data filesoutput_plots/
: output plotsmake_data_beautiful.py
: preprocessing datamain.py
: implementation of Logistic Regression and LDAcomparator.py
: comparing implemented models withsklearn
modelslogistic_missed_predict_investigate/
: investigating missed predictions of logistic regression- ...
-
AREA51/
: test and debug code -
dataset/
: data files -
Notebooks/
: notebooks -
plots/
: analysis plots -
Analysis.R/
: R scripts for analysis -
README.md
: this file -
requirements.txt
: list of necessary packages
Dataset: Breast Cancer Wisconsin (Diagnostic)
- Problem: Predict whether the cancer is benign or malignant
- Data description:
- 569 samples
- 30 features
- 2 classes: benign (357 samples) and malignant (212 samples)
- 1 target:
diagnosis
(B: benign, M: malignant)
- Features description:
id
: ID numberdiagnosis
: diagnosis of breast tissues (B: benign, M: malignant)radius_mean
: mean of distances from center to points on the perimetertexture_mean
: standard deviation of gray-scale valuesperimeter_mean
: mean size of the core tumorarea_mean
: mean smoothness of the tumorsmoothness_mean
: mean number of concave portions of the contourcompactness_mean
: mean fractal dimension of the tumorconcavity_mean
: mean radius of gyration of the tumorconcave points_mean
: mean perimeter of the tumorsymmetry_mean
: mean area of the tumorfractal_dimension_mean
: mean smoothness of the tumorradius_se
: standard error for the mean of distances from center to points on the perimetertexture_se
: standard error for standard deviation of gray-scale valuesperimeter_se
: standard error for the mean size of the core tumorarea_se
: standard error for the mean smoothness of the tumorsmoothness_se
: standard error for the mean number of concave portions of the contourcompactness_se
: standard error for the mean fractal dimension of the tumorconcavity_se
: standard error for the mean radius of gyration of the tumorconcave points_se
: standard error for the mean perimeter of the tumorsymmetry_se
: standard error for the mean area of the tumorfractal_dimension_se
: standard error for the mean smoothness of the tumorradius_worst
: "worst" or largest mean value for mean of distances from center to points on the perimetertexture_worst
: "worst" or largest mean value for standard deviation of gray-scale valuesperimeter_worst
: "worst" or largest mean value for the mean size of the core tumorarea_worst
: "worst" or largest mean value for the mean smoothness of the tumorsmoothness_worst
: "worst" or largest mean value for the mean number of concave portions of the contourcompactness_worst
: "worst" or largest mean value for the mean fractal dimension of the tumorconcavity_worst
: "worst" or largest mean value for the mean radius of gyration of the tumorconcave points_worst
: "worst" or largest mean value for the mean perimeter of the tumorsymmetry_worst
: "worst" or largest mean value for the mean area of the tumorfractal_dimension_worst
: "worst" or largest mean value for the mean smoothness of the tumor
- Target description:
diagnosis
: diagnosis of breast tissues (B: benign, M: malignant)
- Note:
mean
,se
,worst
are computed for each image, resulting in 3 features for each of the original 30 features
- Implement
LDA
- Implement
Logistic Regression
- Understand the data and comprehend the problem
- Data analysis with visualization in
R
,Python
- Implement the statistical analysis for transformation of data
- Outliers detection and investigation
- Implement data transformation
- Implement model evaluation, metrics, and hyperparameter tuning
- Test the
LDA
andLogistic Regression
models with post-processing data - Test the
LDA
andLogistic Regression
models with pre-processing data - Tuning the hyperparameters of
Logistic Regression
- Misclassified data analysis
- Evaluate the models and implement
SVM
,Gaussian Naive Bayes
,XGBoost
andCatBoost
- Implement the ensemble model and compare the results
- Interpret the results
- Write the report
- Author of the
dataset: Breast Cancer Wisconsin (Diagnostic) Data Set:
- Dr. William H. Wolberg, General Surgery Dept., University of Wisconsin, Clinical Sciences Center, Madison, WI 53792
- W. Nick Street, Computer Sciences Dept., University of Wisconsin, 1210 West Dayton St., Madison, WI 53706
- Olvi L. Mangasarian, Computer Sciences Dept., University of Wisconsin, 1210 West Dayton St., Madison, WI 53706
- Logistic Regression lecture notes by Tom Mitchell, Carnegie Mellon University
- Matrix Cookbook by H. Wolkowicz
- Logistic Regression by Andrew Ng, Stanford University
- LDA detailed explanation by Tarek Elgindy, University of Salford
- Book Machine Learning in Action by Peter Harrington
- Machine Learning A Probabilistic Perspective by Kevin P. Murphy