Principal Component Analysis (PCA)👨🏼‍🏫

Principal Component Analysis is a dimensionality-reduction method that is often used to reduce the dimensionality of large dataset. The idea of PCA is simple — reduce the number of variables of a data set, while preserving as much information as possible. Following the application of this method, we have several benefits

we can rank the observations based on several variables
overcome multi-collinearity
data visualization (biplot)

This method is based on the following steps:

Standardize the range of continuous initial variables
Compute the covariance matrix to identify correlations
Compute the eigenvectors and eigenvalues of the covariance matrix to identify the principal components
Create a feature vector to decide which principal components to keep
Recast the data along the principal components axes

We will apply Principal Component Analysis for breast cancer Wisconsin (original) dataset. The dataset contains 699 real observations considering 9 independent variables that allow us to classify the dependent variable as malignant or benign. A brief description of the medical terminology can be consulted in this notebook.

📚References.

Steven M. Holland, Univ. of Georgia: Principal Components Analysis
skymind.ai: Eigenvectors, Eigenvalues, PCA, Covariance and Entropy
Lindsay I. Smith: A tutorial on Principal Component Analysis

MihaiTudor26/Principal-Component-Analysis

Principal Component Analysis (PCA)👨🏼‍🏫