/BreastCancerDiagnostic

This project is a part of the course "Machine Learning" at University Paris Dauphine, PSL.

Primary LanguageJupyter Notebook

Project: Machine Learning

Author:

Description:

Content:

Execution:

  • Requirements: Python version 3.6 or higher.
  • Run file Analysis.R to get the results of statistical analysis
  • Install all required packages by running:
    python3 setup.py install
    
    or you can just run the following command if you have pip installed:
    pip install -r requirements.txt
    
  • Run file make_data_beautiful.py to start preprocessing data process
    python3 src/make_data_beautiful.py
    
  • Run files notebooks to get the results of implemented models (Logistic Regression and LDA)
    ./Notebooks/{LogisticRegression, LDA, NN}.ipynb
    
  • Run files comparator.ipynb to get the results comparing implemented models with sklearn models
      ./Notebooks/comparator.ipynb
    
  • Run files notebooks to get the results of explaining implemented models.
      ./Notebooks/explain_model.ipynb
    

Results:

Table 1: Results of implemented models (Logistic Regression and LDA) and sklearn models

Model Accuracy Precision Recall F1-score ROC AUC
Logistic Regression 0.981481 0.972973 0.972973 0.972973 0.979444
LDA 0.972222 1.0 0.918919 0.957746 0.959459
Neural Network 0.990741 0.973684 1.0 0.986667 0.992958
Linear SVM 0.981481 0.972973 0.972973 0.972973 0.979444
Ridge 0.953704 1.0 0.864865 0.927536 0.932432
XGBoost 0.962963 0.945946 0.945946 0.945946 0.958888
  • The results of statistical analysis are in directory plots/
  • The results of implemented models are in directory src/output_plots/
  • Models are saved in directory src/output_models/
  • HTML files for investigating missed predictions of logistic regression src/logistic_missed_predict_investigate/

Project structure:

  • src/: source code

    • data/: data files
    • output_plots/: output plots
    • make_data_beautiful.py: preprocessing data
    • main.py: implementation of Logistic Regression and LDA
    • comparator.py: comparing implemented models with sklearn models
    • logistic_missed_predict_investigate/: investigating missed predictions of logistic regression
    • ...
  • AREA51/: test and debug code

  • dataset/: data files

  • Notebooks/: notebooks

  • plots/: analysis plots

  • Analysis.R/: R scripts for analysis

  • README.md: this file

  • requirements.txt: list of necessary packages

Overview dataset and problem:

Dataset: Breast Cancer Wisconsin (Diagnostic)

  • Problem: Predict whether the cancer is benign or malignant
  • Data description:
    • 569 samples
    • 30 features
    • 2 classes: benign (357 samples) and malignant (212 samples)
    • 1 target: diagnosis (B: benign, M: malignant)
  • Features description:
    • id: ID number
    • diagnosis: diagnosis of breast tissues (B: benign, M: malignant)
    • radius_mean: mean of distances from center to points on the perimeter
    • texture_mean: standard deviation of gray-scale values
    • perimeter_mean: mean size of the core tumor
    • area_mean: mean smoothness of the tumor
    • smoothness_mean: mean number of concave portions of the contour
    • compactness_mean: mean fractal dimension of the tumor
    • concavity_mean: mean radius of gyration of the tumor
    • concave points_mean: mean perimeter of the tumor
    • symmetry_mean: mean area of the tumor
    • fractal_dimension_mean: mean smoothness of the tumor
    • radius_se: standard error for the mean of distances from center to points on the perimeter
    • texture_se: standard error for standard deviation of gray-scale values
    • perimeter_se: standard error for the mean size of the core tumor
    • area_se: standard error for the mean smoothness of the tumor
    • smoothness_se: standard error for the mean number of concave portions of the contour
    • compactness_se: standard error for the mean fractal dimension of the tumor
    • concavity_se: standard error for the mean radius of gyration of the tumor
    • concave points_se: standard error for the mean perimeter of the tumor
    • symmetry_se: standard error for the mean area of the tumor
    • fractal_dimension_se: standard error for the mean smoothness of the tumor
    • radius_worst: "worst" or largest mean value for mean of distances from center to points on the perimeter
    • texture_worst: "worst" or largest mean value for standard deviation of gray-scale values
    • perimeter_worst: "worst" or largest mean value for the mean size of the core tumor
    • area_worst: "worst" or largest mean value for the mean smoothness of the tumor
    • smoothness_worst: "worst" or largest mean value for the mean number of concave portions of the contour
    • compactness_worst: "worst" or largest mean value for the mean fractal dimension of the tumor
    • concavity_worst: "worst" or largest mean value for the mean radius of gyration of the tumor
    • concave points_worst: "worst" or largest mean value for the mean perimeter of the tumor
    • symmetry_worst: "worst" or largest mean value for the mean area of the tumor
    • fractal_dimension_worst: "worst" or largest mean value for the mean smoothness of the tumor
  • Target description:
    • diagnosis: diagnosis of breast tissues (B: benign, M: malignant)
  • Note: mean, se, worst are computed for each image, resulting in 3 features for each of the original 30 features

TODO:

  • Implement LDA
  • Implement Logistic Regression
  • Understand the data and comprehend the problem
  • Data analysis with visualization in R, Python
  • Implement the statistical analysis for transformation of data
  • Outliers detection and investigation
  • Implement data transformation
  • Implement model evaluation, metrics, and hyperparameter tuning
  • Test the LDA and Logistic Regression models with post-processing data
  • Test the LDA and Logistic Regression models with pre-processing data
  • Tuning the hyperparameters of Logistic Regression
  • Misclassified data analysis
  • Evaluate the models and implement SVM, Gaussian Naive Bayes, XGBoost and CatBoost
  • Implement the ensemble model and compare the results
  • Interpret the results
  • Write the report

References:

License: