/Breast-Cancer-Detection

Detecting breast cancer using machine learning

Primary LanguageJupyter Notebook

Breast-Cancer-Detection

Project Title

A brief description of what your project does and the value it provides.

Table of Contents

Dataset

The dataset used is the Diagnostic Wisconsin Breast Cancer Dataset, it contains 30 features holding information about geometry and texture of the cells,

eg:

  • Radius
  • Perimeter
  • Area
  • Texture
  • Smoothness
  • Compactness
  • Concavity
  • Concave Points
  • Symmetry
  • Fractal Dimension

Methodology

Data Preprocessing

To clean the dataset, we must check for outliers, missing data and other data abnormalities, fortunately the only thing facing us here is outliers, to check for them used box plots for all of the features to see if there are outliers in the data, outliers were found in almots all the features, to solve this issue we can use min-max clipping, we calculated the min and max as min = Q1 - 1.5(Q3 - q1) max = min = Q3 + 1.5(Q3 - q1), then ran a loop on the dataset to replace the outliers with either the min or the max using those lines, where x_clean is the dataset without outliers

 #Remove outliers from the Dataframe
    X_clean[col].replace(list(X_clean[X_clean[col] > upper_bound][col]), upper_bound, inplace=True)
    X_clean[col].replace(list(X_clean[X_clean[col] < lower_bound][col]), lower_bound, inplace=True)

here are some screenshots of the box plots before and after the clipping:
Before clipping:

Screenshot from 2024-01-28 14-11-04 Screenshot from 2024-01-28 14-11-49
After clipping:

Screenshot from 2024-01-28 14-12-14 Screenshot from 2024-01-28 14-12-31

The data was then normalized with a standard deviation of 1 and mean 0, to prevent data with larger values to skew the results in the expense of the features with the smaller values

Feature Selection

In order to select the most significant features that will yeild the best perfomance, we calculated the spearman correlation coefficient between the features and the lables and removed any feature that yeilded a score below 0.5, then plotting the matrix gave the following

WhatsApp Image 2023-12-30 at 1 52 21 AM

Data Splitting

The data was split with ratio of train to test 7:3, the same dataset was used before and split with ration 8:2 and we needed to experiment more, we used stratified random sampling to make sure the train data is represents the original dataset as much as possible

Model Training

For each model (Random Forest, Logistic Regression, K-Nearest Neighbors, XGBoost, and SVM), explain why you chose it, how it works, and any specific parameters you tuned.

Cross Validation

Explain why you used K-Fold cross-validation with k=5, how it helps in model validation, and what the results were.

Performance Evaluation

Discuss how you calculated the confusion matrix and other metrics (accuracy, precision, F1 score, and recall). Explain what each metric indicates about your model's performance.

Results

Discuss the results of your models, any insights you gained, and what these results mean.

Contributing

Information about how other developers can contribute to your project.

License

Information about the license.