This is a coursework I did in R as part of the ST3189 coursework. It invovles three datasets, European Working Conditions Survey 2016, Bank Marketing Dataset and Student Performance Dataset. In this repo, there are data descriptions files, write-up and the R code.
The prompt are as follows: In the first part you will be given data collected from various individuals on several variables. The goal will be to use unsupervised learning techniques such as Principal Component Analysis or Cluster Analysis to summarise the information in the data by appropriate tables and/or plots. In the second part you will be presented with a regression problem. The aim would be to compare various models and techniques for their estimation to allow meaningful interpretation and competitive predictive performance. The latter should be assessed by appropriate experiments based on training and test datasets. In addition to linear regression, Tree based methods, Non-linear models or other suitable techniques can be used if you think they can provide improvement. Finally, in the third part you will be given a classification problem. The analysis will contain similar steps with the 2nd part but you should be able to interpret the output from different models and compare their predictive performance taking into account that the response variable will be binary. In addition to appropriate regression or discriminant analysis, Tree-based methods, Non-linear models or other suitable techniques can be used if you think they will perform better.
Tools involved: Principal Component Analysis, Multiple linear regression, Logistic Regression, Linear Discriminant Analysis, Quadratic Discriminant Analysis, K-Nearest Neighbours classifier and Classification Tree.