Principal Component Analysis is a dimensionality-reduction method that is often used to reduce the dimensionality of large dataset. The idea of PCA is simple — reduce the number of variables of a data set, while preserving as much information as possible. Following the application of this method, we have several benefits
- we can rank the observations based on several variables
- overcome multi-collinearity
- data visualization (biplot)
This method is based on the following steps:
- Standardize the range of continuous initial variables
- Compute the covariance matrix to identify correlations
- Compute the eigenvectors and eigenvalues of the covariance matrix to identify the principal components
- Create a feature vector to decide which principal components to keep
- Recast the data along the principal components axes
We will apply Principal Component Analysis for breast cancer Wisconsin (original) dataset. The dataset contains 699 real observations considering 9 independent variables that allow us to classify the dependent variable as malignant or benign. A brief description of the medical terminology can be consulted in this notebook.
📚References.
- Steven M. Holland, Univ. of Georgia: Principal Components Analysis
- skymind.ai: Eigenvectors, Eigenvalues, PCA, Covariance and Entropy
- Lindsay I. Smith: A tutorial on Principal Component Analysis