Project for the DataMining 1 exam. The dataset provided is a modified version (some values removed and made "missing values") of data present on https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset. The analysis therefore concerns IBM HR Analytics Employee Attrition & Performance.
- Data semantics (3 points)
- Distribution of the variables and statistics (7 points)
- Assessing data quality (missing values, outliers) (7 points)
- Variables transformations (6 points)
- Pairwise correlations and eventual elimination of redundant variables (7 points)
- Choice of attributes and distance function (1 points)
- Identification of the best value of k (5 points)
- Characterization of the obtained clusters by using both analysis of the k centroids and comparison of the distribution of variables within the clusters and that in the whole dataset (7 points)
- Choice of attributes and distance function (2 points)
- Study of the clustering parameters (2 points)
- Characterization and interpretation of the obtained clusters (5 points)
- Choice of attributes and distance function (2 points)
- Show and discuss different dendograms using different algorithms (3 points)
Final evaluation of the best clustering approach and comparison of the clustering obtained (3 points)
- Frequent patterns extraction with different values of support and different types (i.e. frequent, close, maximal), (6 points)
- Discussion of the most interesting frequent patterns and analyze how changes the number of patterns w.r.t. the min_sup parameter (7 points)
- Association rules extraction with different values of confidence (6 points)
- Discussion of the most interesting rules and analyze how changes the number of rules w.r.t. the min_conf parameter, histogram of rules' confidence and lift (7 points)
- Use the most meaningful rules to replace missing values and evaluate the accuracy (2 points)
- Use the most meaningful rules to predict the target variable and evaluate the accuracy (2 points)
- Learning of different decision trees/classification algorithms with different parameters and gain formulas with the object of maximizing the performances (12 points)
- Decision trees interpretation, validation with test and training set (6 points)
- Training of different KNN classifiers with different parameters with the object of maximizing the performances (6 points)
- Discussion of the best prediction model (6 points)