TSM

This package, TSM (aka. The Smith Method), is to select a desired number of features (4 by default) by purposefully dropping highly correlated ones, i.e, picking up highly representative features that can best explain the binary outcomes. In plain English, it works like the follwoing: The first representative feature is the one that shows the highest AUC (Area Under the ROC Curve) out of all the features. The next representative feature is the one that shows the highest AUC out of the remaing features after dropping highly correlated features with the first representative feature. The third, the fourth, and so on, represenative feature will be picked up as the same way the 2nd is picked up.

By default, spearman correlation coefficient is used to check the correlation among the possible features and they will be dropped if they are above a certain threshold (e.g. 0.5) by leaving a represenative one having the best AUC. It recursively checks their correlations and drops the feature until nothing left to check. By default, the thresholds for the correlation coefficients are set from 0.1 (i.e. highly stringent by leaving less features) to 0.7 (i.e. less stringent by leaving more features) by increasing 0.1 at each step.

Installation

You can install the development version of TSM like so:

devtools::install_github("sung/TSM")

Example

The input file should have features as columns, including the outcome column (y by default) which contains a binary outcome, either 0 or 1. For example:

library(TSM)
input=read.csv(system.file("extdata","demo_input.csv",package="TSM")) # read the example input from TSM 
input[1:5,]
#>       F1     F2     F3     F4     F5     F6     F7     F8     F9    F10    F11
#> 1 3.0978 5.3539 3.4697 3.3862 6.5225 1.3523 1.4245 2.2792 0.5276 2.6810 2.4484
#> 2 3.1000 5.6593 3.7335 3.2010 6.9357 1.2460 1.6861 2.3034 0.7733 2.6492 2.4600
#> 3 3.6803 5.4233 3.8745 3.3044 5.7258 1.1525 1.3389 2.2688 0.4059 2.6794 1.6202
#> 4 3.0374 5.7566 3.4998 3.2889 6.0595 1.4913 1.5907 2.1947 0.3826 1.4088 1.3074
#> 5 3.3356 5.0057 3.1617 3.0673 5.0654 0.7485 1.1178 1.6380 0.3623 1.7394 1.2593
#>      F12    F13    F14     F15    F16    F17    F18    F19    F20    F21 y
#> 1 1.7607 1.6585 2.5810 -2.5896 8.4015 3.5893 3.2906 2.0396 1.0402 7.9531 1
#> 2 1.7252 1.3809 2.6356 -1.0027 8.3084 3.7016 3.9182 2.4075 0.7266 8.2472 1
#> 3 1.1248 1.0232 1.8736 -2.8647 8.3134 2.6569 3.7619 2.0328 0.8711 7.9282 0
#> 4 1.7147 1.2212 2.2696 -1.1208 8.0433 3.6225 3.6321 2.2341 0.9412 7.8447 1
#> 5 1.6505 0.8490 1.7405 -2.0821 7.4438 2.6039 2.0825 0.4843 0.5597 7.6459 1

TSM(x=input) # run TSM with default parameters
#> calculating AUC for each features...
#> cor0.1
#> cor0.2
#> cor0.3
#> cor0.4
#> cor0.5
#> cor0.6
#> cor0.7
#>    Cor Num features                                                 Features
#> 1: 0.4            4                                             F1,F7,F8,F14
#> 2: 0.5            7                                  F1,F7,F13,F8,F9,F14,F15
#> 3: 0.7           16 F1,F4,F6,F5,F7,F13,F8,F9,F12,F11,F14,F15,F21,F17,F18,F20
#> 4: 0.6           11                   F1,F5,F7,F13,F8,F9,F14,F15,F21,F17,F18
#> 5: 0.1            1                                                       F1
#> 6: 0.2            1                                                       F1
#> 7: 0.3            1                                                       F1
#>    Best features      AIC      BIC       AUC AUC(LPOCV)
#> 1:  F1,F7,F8,F14 106.9498 120.8872 0.8815629  0.8659951
#> 2:  F1,F7,F13,F8 108.8605 122.7980 0.8806471  0.8580586
#> 3:   F1,F4,F6,F5 111.8757 125.8131 0.8684371  0.8489011
#> 4:  F1,F5,F7,F13 111.5937 125.5312 0.8666056  0.8476801
#> 5:            F1 113.1852 118.7602 0.8446276  0.8446276
#> 6:            F1 113.1852 118.7602 0.8446276  0.8446276
#> 7:            F1 113.1852 118.7602 0.8446276  0.8446276

TSM(x=input, corr=c(0.4, 0.5)) # two correlation coefficients only 
#> calculating AUC for each features...
#> cor0.4
#> cor0.5
#>    Cor Num features                Features Best features      AIC      BIC
#> 1: 0.4            4            F1,F7,F8,F14  F1,F7,F8,F14 106.9498 120.8872
#> 2: 0.5            7 F1,F7,F13,F8,F9,F14,F15  F1,F7,F13,F8 108.8605 122.7980
#>          AUC AUC(LPOCV)
#> 1: 0.8815629  0.8659951
#> 2: 0.8806471  0.8580586

TSM(x=input, corr=c(0.4, 0.5),k=3) # two correlation coefficients and three features only 
#> calculating AUC for each features...
#> cor0.4
#> cor0.5
#>    Cor Num features                Features Best features      AIC      BIC
#> 1: 0.4            4            F1,F7,F8,F14      F1,F7,F8 107.8816 119.0316
#> 2: 0.5            7 F1,F7,F13,F8,F9,F14,F15     F1,F7,F13 109.9227 121.0727
#>          AUC AUC(LPOCV)
#> 1: 0.8751526  0.8635531
#> 2: 0.8659951  0.8528694

TSM(x=input, corr=c(0.4, 0.5),k=3,method="pearson") # pearson method  
#> calculating AUC for each features...
#> cor0.4
#> cor0.5
#>    Cor Num features       Features Best features      AIC      BIC       AUC
#> 1: 0.5            4 F1,F13,F15,F18    F1,F13,F15 107.7559 118.9058 0.8736264
#> 2: 0.4            3     F1,F11,F14    F1,F11,F14 109.9812 121.1312 0.8672161
#>    AUC(LPOCV)
#> 1:  0.8574481
#> 2:  0.8489011

As shown above, TSM returns a table (data.table) by descending order of Leave-Pair-Out-Cross-Validation (LPOCV) Gordon Am J Epi 2014 at each threshold of correlation coefficients.

In case you’d like more verbose version of outcomes:

foo<-TSM(x=input,corr=c(0.4,0.5),k=3,verbose=T)
#> calculating AUC for each features...
#> cor0.4
#> cor0.5

foo[["cor0.4"]]
#> $top.rank
#> [1] "F1"  "F7"  "F8"  "F14"
#> 
#> $cor
#>  [1] "F1"  "F2"  "F3"  "F4"  "F5"  "F6"  "F9"  "F10" "F11" "F12" "F13" "F15"
#> [13] "F16" "F17" "F18" "F19" "F21" "F7"  "F8"  "F14"
#> 
#> $num.cor
#> [1] 17  1  1  1
#> 
#> $fit
#> 
#> Call:  glm(formula = y ~ ., family = "binomial", data = my.data)
#> 
#> Coefficients:
#> (Intercept)           F1           F7           F8  
#>      15.134       -2.530       -2.577       -1.531  
#> 
#> Degrees of Freedom: 119 Total (i.e. Null);  116 Residual
#> Null Deviance:       155.4 
#> Residual Deviance: 99.88     AIC: 107.9
foo[["cor0.5"]]
#> $top.rank
#> [1] "F1"  "F7"  "F13" "F8"  "F9"  "F14" "F15"
#> 
#> $cor
#>  [1] "F1"  "F2"  "F3"  "F4"  "F5"  "F6"  "F10" "F12" "F16" "F17" "F19" "F7" 
#> [13] "F18" "F11" "F13" "F8"  "F9"  "F21" "F14" "F15"
#> 
#> $num.cor
#> [1] 11  2  2  1  2  1  1
#> 
#> $fit
#> 
#> Call:  glm(formula = y ~ ., family = "binomial", data = my.data)
#> 
#> Coefficients:
#> (Intercept)           F1           F7          F13  
#>      14.162       -2.756       -2.368       -1.268  
#> 
#> Degrees of Freedom: 119 Total (i.e. Null);  116 Residual
#> Null Deviance:       155.4 
#> Residual Deviance: 101.9     AIC: 109.9

data.table::rbindlist(foo[["performance"]])[order(-`AUC(LPOCV)`)]
#>    Cor Num features                Features Best features      AIC      BIC
#> 1: 0.4            4            F1,F7,F8,F14      F1,F7,F8 107.8816 119.0316
#> 2: 0.5            7 F1,F7,F13,F8,F9,F14,F15     F1,F7,F13 109.9227 121.0727
#>          AUC AUC(LPOCV)
#> 1: 0.8751526  0.8635531
#> 2: 0.8659951  0.8528694