/Breast_Cancer_Prediction

Breast Cancer Prediction and Data Visualization Using K-Nearest Neighbor (KNN) and Support Vector Machine (SVM) Classifier Algorithms.

Primary LanguageJupyter Notebook

Breast_Cancer_Prediction

Breast Cancer Prediction and Data Visualization

  • Using K-Nearest Neighbor (KNN) Classifier Algorithm K-Nearest Neighbors: algorithm for supervised learning. Data is 'trained' with data points corresponding to their classification. Once a point is to be predicted, it takes into account the 'K' nearest points to it to determine it's classification.

  • Using Support Vector Machine (SVM) Classifier Algorithm SVM: maps data to a high-dimensional feature space so that data points can be categorized, even when the data are not otherwise linearly separable. A separator between the categories is found, then the data is transformed in such a way that the separator could be drawn as a hyperplane. Following this, characteristics of new data can be used to predict the group to which a new record should belong.

Dataset: Breast Cancer Wisconsin (Diagnostic) Data Set (https://www.kaggle.com/uciml/breast-cancer-wisconsin-data/)

@author: Mariana R. Barros

=============================================================================

Attribute Information:

  1. ID number
  2. Diagnosis (M = malignant, B = benign)
  3. Ten real-valued features are computed for each cell nucleus:

a) radius (mean of distances from center to points on the perimeter) b) texture (standard deviation of gray-scale values) c) perimeter d) area e) smoothness (local variation in radius lengths) f) compactness (perimeter^2 / area - 1.0) g) concavity (severity of concave portions of the contour) h) concave points (number of concave portions of the contour) i) symmetry j) fractal dimension ("coastline approximation" - 1)

The mean, standard error and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.

All feature values are recoded with four significant digits.

Missing attribute values: none

Class distribution: 357 benign, 212 malignant