Breast cancer is the most common type of cancer at a staggering 12% of new cases a year, according to the World Health Organization. It is estimated that in 2022 there will be 287,850 new cases of cancer. From the moment a biopsy is done to the moment the results are given to a patient requires a lot of resources. A big team of concerned doctors, educated people in the lab, certified pathologists, and speedy transcribers are needed to produce a diagnosis.
The Breast Cancer Wisconsin (Diagnostic) Data Set was taken from the UCI Machine Learning Repository. The features describe characteristics of the cell nuclei present computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. The data consists of the characteristics of 569 breast mass images with thirty-three variables.
Source: https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic)
When I worked for a pathology lab, I observed how complex the process to generate biopsy results is. The purpose of this analysis is to use machine learning to classify benign or malignant breast cancer diagnosis based on a portion of the characteristics of a FNA of a breast mass. To use data science to make a complicated and lengthy process an easier one. The goal of this project is to build a model that can accurately predict the diagnosis of breast cancer tissues as either malignant or benign.
How do Decision Tree, Random Forest, Logistic Regression, Support Vector Machines (SVM), Naïve Bayes (NB), Stochastic Gradient Descent (SGD), and K Nearest Neighbors (KNN) compare with each other in classifying whether a mass is benign or malignant based on the FNA’s characteristics?
- Diagnosis (M = malignant, B = benign) 357 observations in the benign class and 212 observations in the smaller malignant class. The distribution of the target variable is not the best (50-50), but it is not terrible either at a 37% malignant and 63% benign distribution. ROC, F1 score, precision and recall scores were used to evaluate the algorithms to compensate for this slight imbalance.
The distribution of the target variable is not the best (50/50) but it isn’t terrible either at a 37/63% distribution.
-
ID number
-
3-32) Ten real-valued features are computed for each cell nucleus:
- a) radius (mean of distances from center to points on the perimeter) - b) texture (standard deviation of gray-scale values) - c) perimeter - d) area - e) smoothness (local variation in radius lengths) - f) compactness (perimeter^2 / area - 1.0) - g) concavity (severity of concave portions of the contour) - h) concave points (number of concave portions of the contour) - i) symmetry - j) fractal dimension ("coastline approximation" - 1)
For each breast mass FNA image the mean, standard error, and “worst” or largest (mean of the three largest values) measure was calculated for the ten features from a) to j), resulting in thirty features. For example, columns named radius_mean, radius_se, and radius_worst.
The dataset was analyzed, and the data was not considered to be very dirty. Id, and Unnamed null value columns were dropped. Multicollinearity is present in this dataset. The mean, standard error and worst measures are correlated (radius_mean, radius_se, and radius_worst are correlated), and some of the variables such as radius, perimeter, and area are highly correlated. The other problem variables are compactness, concavity, concave points, and fractal_dimension. Data scientists are like bartenders they need to get the ingredients ready for different drinks. The dataset was scaled appropriately based on the model and the type of data (Diagnosis: M=1 or B=0 is changed using LabelEncoder() and continuous using RobustScaler()). Multicollinearity was taken into consideration for the models that are sensitive to it. The pair plots were a great visual summary showing how a number of the variables were going to be great for the classification models and others would not be. The boxplots showed skewness caused by the outliers. Outliers were not dropped because there is not enough information about them and every observation is important since there is a small number of observations, these outliers might be a representative case.
Random Forest revealed radius_worst is a very important feature.
The Lasso coefficients indicated concave point_mean as the most important variable.
The data was split into 70% training data and 30% testing data. There are a lot of variables, and the algorithms were performing below what was expected, so feature importance using Decision Trees and Lasso were used to select the most important variables. Different subsets of the variables were used taking feature selection methods and multicollinearity into account.
A Logistic Regression (Ridge and Lasso), NB (Gaussian), SVM, SGD (and SG Boosting), Decision Tree, Random Forest, and KNN models were applied. The training data was used to fit the model. In an intent to improve the results the seven classification algorithms were applied using a plethora of subsets of the variables. A model with all the features was used to try to reduce the number of malignant cases that were misclassified. In health especially with cancer diagnosis false negatives are bad. Each model was hyper tuned using ROC for scoring and Repeated Stratified K-Fold was used for cross validation to compensate for the slight imbalance of the distribution of the outcome variable. Repeated Stratified K-Fold is a cross validator that maintains the same class ratio throughout the K folds as the ratio in the original dataset. The best model was Logistic Regression. This model had a ROC score of 98% with only one malignant case classified incorrectly.
The ROC Curve shows the model is doing a lot better than just flipping a coin. At a 98% score the model is doing really good.
0 is Benign and 1 is Malignant (the smaller class). For the best logistic regression model the confusion matrix shows only 1 Malignant case inaccurately classified as benign.
A variety of models and hyperparameter tuning was performed to improve the results. The best model overall was Logistic Regression. The model’s confusion matrix shows only one malignant case inaccurately categorized as benign and two benign cases inaccurately classified. The ROC Curve shows the model is doing a lot better than just flipping a coin at a 98% score. The second best was the Random Forest. This model’s confusion matrix shows two malignant and two benign cases inaccurately classified. The ROC Curve shows a 96% score which is great, but it is not as good as the Logistic Regression model. A residual outlier sensitivity check was conducted. I looked at the outlier residuals between the predicted probabilities and the actual outcome and found one observation (73) that the model was 99% sure it was benign, but it was malignant. I dropped this “outlier”, but it did not significantly improve the algorithms and it could be a representative case, so the results from that subset were not taken very seriously. In the future a bigger set of variable observations is needed to improve the predictions and make this a trustworthy model.