A brief description of what your project does and the value it provides.
The dataset used is the Diagnostic Wisconsin Breast Cancer Dataset, it contains 30 features holding information about geometry and texture of the cells,
eg:
- Radius
- Perimeter
- Area
- Texture
- Smoothness
- Compactness
- Concavity
- Concave Points
- Symmetry
- Fractal Dimension
To clean the dataset, we must check for outliers, missing data and other data abnormalities, fortunately the only thing facing us here is outliers, to check for them used box plots for all of the features to see if there are outliers in the data, outliers were found in almots all the features, to solve this issue we can use min-max clipping, we calculated the min and max as min = Q1 - 1.5(Q3 - q1) max = min = Q3 + 1.5(Q3 - q1)
, then ran a loop on the dataset to replace the outliers with either the min
or the max
using those lines, where x_clean
is the dataset without outliers
#Remove outliers from the Dataframe
X_clean[col].replace(list(X_clean[X_clean[col] > upper_bound][col]), upper_bound, inplace=True)
X_clean[col].replace(list(X_clean[X_clean[col] < lower_bound][col]), lower_bound, inplace=True)
here are some screenshots of the box plots before and after the clipping:
Before clipping:
The data was then normalized with a standard deviation of 1 and mean 0, to prevent data with larger values to skew the results in the expense of the features with the smaller values
In order to select the most significant features that will yeild the best perfomance, we calculated the spearman correlation coefficient between the features and the lables and removed any feature that yeilded a score below 0.5, then plotting the matrix gave the following
The data was split with ratio of train to test 7:3, the same dataset was used before and split with ration 8:2 and we needed to experiment more, we used stratified random sampling to make sure the train data is represents the original dataset as much as possible
For each model (Random Forest, Logistic Regression, K-Nearest Neighbors, XGBoost, and SVM), explain why you chose it, how it works, and any specific parameters you tuned.
Explain why you used K-Fold cross-validation with k=5, how it helps in model validation, and what the results were.
Discuss how you calculated the confusion matrix and other metrics (accuracy, precision, F1 score, and recall). Explain what each metric indicates about your model's performance.
Discuss the results of your models, any insights you gained, and what these results mean.
Information about how other developers can contribute to your project.
Information about the license.