The project explores machine learning approaches, specifically supervised learning (classification), to create a predict model identifying the diagnoses of breast cancer.
The Breast Cancer Wisconsin Diagnostic dataset is sourced from the University of California, Irvine's machine learning repository. For more, information please reference the site's documentation.
The dataset has 16 instances of missing attributes, denoted by ?
in the original data.
The data contains 10 features related to the diagnoses of breast cancer:
- Sample Code Number:
id
- Clump Thickness:
clump_thickness
- Uniformity of Cell Size:
size_uniformity
- Uniformity of Cell Shape:
shape_uniformity
- Marginal Adhesion:
adhesion
- Single Epithelial Cell Size:
epithelial_size
- Bare Nuclei:
bare_nuclei
- Bland Chromatin:
bland_chromatin
- Normal Nucleoli:
normal_nucleoli
- Mitoses:
mitoses
The classes for the dataset include: benign for non-cancerous tumors and malignant for active cancer tumors.
- Benign:
0
- Malignant:
1
There are approximately 65% benign cases (nbenign=658) and 34.5% malignant cases (nmalignant=241).
Run using the following command:
python src/clf.py
Training Set | Testing Set | Random State (Generator) |
---|---|---|
70% (n=478) | 30% (n=205) | 2 |
Note: The random seed generator is implemented for developmental purposes to ensure consistency of data partitioning as the code is debugged and tweaked.
Feature selection, otherwise known as the reduction of dimensionality on sample sets, is intended to improve the estimator's accuracy and boost performance, especially on very high-dimensional datasets.
Essentially, it's implemented as a pre-processing step to fitting a predictive model (supervised learning) with the purpose of reducing noisy, insignificant features.
Two feature selection methods were explored: univariate feature selection and Random Forest.
Univariate feature selection selects the best features based on univariate statistical tests.
Thesklearn.feature_selection.SelectKBest
function will be utilized for univariate feature selection. It removes all but
k
highest scoring features (per their scoring function scores).
Chi2 is the scoring function used for this univariate feature selection. The chi2 statistic between each non-negative feature and class is computed.
The chi2 test measures dependence between variables, filtering out features most likely to be independent of class (lower chi2 scores and higher p-values). Thus, irrelevant for classification.
The Random Forest classifier model also implicitly conducts feature selection during its fitting/training. As the estimator prunes the decision trees into smaller and more optimal decision trees, the features are ranked by importance. Thus, enabling the recognition of the most important features from the relatively insignificant ones in order to boost the classifier's performance.
The ranking of features are an attribute of the model and can be viewed through the attribute: .feature_importances_
- Support Vector Machine
- Logistic Regression
- Gaussian Naive Bayes
- Random Forest
Please refer to the writeup.md markdown file for discussion of the results.
sensitivity = TP / (TP + FN)
Sensitivity is the probability that a diseased patient has a positive test (malignant). In other words, it's the likelihood that a patient with malignant breast cancer has a positive diagnosis.
Purpose: Determine how good the model is at detecting positive diagnoses.
specificity = TN / (FP + TN)
The probability of a negative test result given the absence of the disease. In other words, the likelihood that a patient with benign breast cancer has a negative diagnosis.
Purpose: Determine how good the model is at avoiding false alarms (false positive diagnoses).
precision = TP / (TP + FP)
The percentage of results obtained that were actually correct. The number of patients classified as being diagnosed with malignant breast cancer that were actually positive for the diagnosis.
Purpose: Determine how many of the positively diagnoses patients were relevant and verify if the test is cheating.
recall = TP / (TP + FN)
Purpose: Same as sensitivity
F1-measure is essentially a balance between recall and precision, providing a harmonic mean with a value range of [0, 1].
Accuracy is how well the model correctly identifies or excludes a classification/outcome. In other words, it’s a measure of the ratio of all classification instances that were correctly categorized/classified.
ROC curves evaluates the output quality of the classifier. It typically features the true positive rate and false positive rate on the Y and X axis, respectively.
The goal is to have a false positive rate of 0 and true positive rate of 1 for optimal classifications. Thus, the ideal point on the ROC curve is at the "top left corner" for an area under the curv value of 1.0.
The steepness of the ROC curve is significant as it illustrates the extent of the model's maximization the true positive rate and minimization of the false positive rate.
-
O. L. Mangasarian and W. H. Wolberg: "Cancer diagnosis via linear programming", SIAM News, Volume 23, Number 5, September 1990, pp 1 & 18.
-
William H. Wolberg and O.L. Mangasarian: "Multisurface method of pattern separation for medical diagnosis applied to breast cytology", Proceedings of the National Academy of Sciences, U.S.A., Volume 87, December 1990, pp 9193-9196.
-
O. L. Mangasarian, R. Setiono, and W.H. Wolberg: "Pattern recognition via linear programming: Theory and application to medical diagnosis", in: "Large-scale numerical optimization", Thomas F. Coleman and Yuying Li, editors, SIAM Publications, Philadelphia 1990, pp 22-30.
-
K. P. Bennett & O. L. Mangasarian: "Robust linear programming discrimination of two linearly inseparable sets", Optimization Methods and Software 1, 1992, 23-34 (Gordon & Breach Science Publishers).