/Model-agnostic-feature-selection

Development of a feature selection scheme that is robust across all the datasets and regardless of the ML model used for classification

Primary LanguageJupyter NotebookGNU General Public License v3.0GPL-3.0

CBSD-2022-UNIPD

This study focuses on the replicability of finding relevant predictors for lie detection in various psychometric tests concerning medicine, behavioral science and data science that have been compiled twice, once honestly and once dishonestly. More precisely, the goal is to develop a framework for feature selection that leads to good and similar results for different models used for the discrimination of honesty and dishonesty of test responses. Accuracy, Top-5 stability and Accuracy Standard Deviation are the metrics used to evaluate the results.

Overall_comparison

Approaches used

The approaches developed in this project to select the features are the following:

  1. PCA: select 20% of the total number of features using the principal component analysis.
  2. Permutation importance: fitted on a random forest, with features selected based on t-test.
  3. Mutual Information: the features selected by the Joint Mutual Information Maximization (JMIM) algorithm with an importance score of at least 0.8 out of 1 are used.

Before applying the methods, the datasets are split into training and test (70%-30%) and for every feature, the mean and the standard deviation are computed in order to scale that feature: $Z=\frac{X-\mu}{\sigma}$. These three methods are independent of each other and each one of them is going to be described in depth later on.

Models used

Each one of the approaches considered in this project, as mentioned before, selects a number of features from the corresponding original dataset, and then these selected features are used to train different models and to observe their performance. The models trained in this project are:

  1. Logistic regression model on all the features (Full LR)
  2. Logistic regression model on selected features (LR)
  3. Support vector machine (SVM)
  4. Random forest (RF)
  5. Multi-layer perceptron classifier (MLP)

For each of these models is also computed the related accuracy in order to see firstly how good that model is performing with the selected features and secondly to compare the models between them in order to figure out if the selected features give similar performances among all the models. A logistic regression with all the features is trained at the beginning. In this way, it’s possible to have a comparison between the results obtained with the selected features.

Metrics

  • Accuracy: ratio of correct predictions over the number of instances. This has been chosen as all the datasets show a fairly balanced number of examples per class (all are binary classification tasks). The accuracy is computed on the full model (Full LR) as well as all the other four models used for benchmarking and trained only on the subset of features selected by each of the procedures in scope.
  • Accuracy Standard Deviation: standard deviation of the four models (i.e. LR, SVM, RF, MLP) fitted on the subset of selected features. It is a measure of the consistency of the classification performance across different models, thus the lower the better.
  • Top-5 stability: a more specific metric for assessing consistency across models (i.e. LR, SVM, RF, MLP). It takes into account the first five most important features used by each of the models, the formula developed is:
$$TOP5 Stability=1-\big(\frac{1}{(num.models-1)\cdot\min{(5,|\Omega|)}}\sum^{\min{(5,|\Omega|)}}_{i=1}{|\beta_{i}|-1}\big)$$

where $\Omega$ is the set of features selected by a procedure, i.e. $\Omega={\beta_1,...,\beta_n}$; $\beta_i$ is a vector with the feature selected with importance $i$ across the models (notice that in our case $num.models=4$). Finally, $|\beta_i|$ is the number of unique values in $\beta_i$.

Datasets

Name Topic Faking good/faking bad Number of samples Numbers of features
DT_df_CC Short Dark Triad 3 for child costudy Faking good 482 27
DT_df_JI Short Dark Triad 3 for a job interview Faking good 864 27
PRMQ_df Identify memory difficulties Faking bad 1404 16
PCL5_df Identify victims of PTSD Faking bad 402 20
NAQ_R_df Identify possible victims of mobbing Faking bad 712 22
PHQ9_GAD7_df Identify possible victims of anxious-depressive syndrom Faking bad 1118 16
PID5_df Identify mental disorders Faking bad 824 220
sPID5_df Identify mental disorders Faking bad 1038 25
PRFQ_df Specific caregivers' ability to mentalize with their children Faking good 678 18
IESR_df Identify possible victims of PTSD Faking bad 358 22
R_NEO_PI_df Personality questionnaire (Big5) Faking good 77687 30
RAW_DDDT_df Identify Dark Triad personality Faking bad 986 12
IADQ_df Identify adjustment disorder (stress response syndrome) Faking bad 450 9
BF_df_CTU Job interview for a salesperson position Faking good 442 10
BF_df_OU Job interview for in humanitarian organization Faking good 460 10
BF_df_V Obtain child costudy Faking good 486 10