Introduction

Machine learning with a data set that has an enormous amount of features is time consuming. Dimension Reduction (DR) is a data preprocessing technique that preserves the most salient information and removes the data that is irrelevant to analysis and prediction. By focusing on the smaller but more informative data, DR significantly improves the learning efficiency. In this assignment, there are 612 samples (rows) of emotion data, each sample has 132 features (columns) and a specific target (emotion classification). Correlation Feature Selection (CFS) and Principal Component Analysis (PCA) are two DR approaches that we implemented and analyzed. The report will explain CFS and PCA and how we apply them to preprocess data before decision tree learning. It is also necessary to consider how performance of trained decision trees compare with different preprocessing techniques by analyzing the approaches. The 10-fold cross-validation is used in this assignment and the details of it is associated with the two DR approaches are demonstrated.