topepo/caret

"Something is wrong" occurs with X and Y input but not with a formula (Y ~ .). Why?

luizalmeida93 opened this issue · 1 comments

Hi,

I've encountered the "Something is wrong; all the Accuracy metric values are missing:" error when one of my columns is a factor. However, the help function did not specify any limitation for the data structure. I think I managed to get around it, but I just wanted to understand if the "workaround" is not changing the data analysis by any means.

Here is a toy data set:

library(caret)

set.seed(2023)
x_train <- data.frame(feat1 = rnorm(80, 0, 2),
                      feat2 = rnorm(80, 0, 2),
                      feat3 = rnorm(80, 0, 2),
                      feat4 = rnorm(80, 0, 2),
                      feat5 = rnorm(80, 0, 2),
                      Male = factor(rbinom(n = 80, size = 1, prob = 0.5)))

set.seed(123)
y_target <- factor(rbinom(n = 80, size = 1, prob = 0.5))

If I run the train function using x_train and y_target as inputs, I get the error:

elaNet_model <- caret::train(x_train, y_target, method = "glmnet")

Something is wrong; all the Accuracy metric values are missing:
    Accuracy       Kappa    
 Min.   : NA   Min.   : NA  
 1st Qu.: NA   1st Qu.: NA  
 Median : NA   Median : NA  
 Mean   :NaN   Mean   :NaN  
 3rd Qu.: NA   3rd Qu.: NA  
 Max.   : NA   Max.   : NA  
 NA's   :9     NA's   :9    
Error: Stopping
In addition: There were 50 or more warnings (use warnings() to see the first 50)

However, I saw somewhere else that using the formula input could solve the problem, so I tested it:

test_df <- cbind(y_target, x_train)

elaNet_model <- caret::train(y_target ~.,
                             data = test_df,
                             method = "glmnet")

Indeed no error message is displayed, and I get the full model. The factor feature is still a factor:

> str(test_df)
'data.frame':	80 obs. of  7 variables:
 $ y_target: Factor w/ 2 levels "0","1": 1 2 1 2 2 1 2 2 2 1 ...
 $ feat1   : num  -0.168 -1.966 -3.75 -0.372 -1.267 ...
 $ feat2   : num  0.437 3.486 -0.239 0.972 0.283 ...
 $ feat3   : num  3.447 0.653 1.03 1.6 -1.882 ...
 $ feat4   : num  -0.687 -1.109 2.243 0.38 0.33 ...
 $ feat5   : num  2.44 -0.555 1.062 2.094 0.845 ...
 $ Male    : Factor w/ 2 levels "0","1": 1 1 2 1 1 2 2 2 2 1 ...

If I transform the factor column into numeric, I get no error as well:

x_train2 <- x_train
x_train2$Male <- as.numeric(as.character(x_train2$Male))
elaNet_model <- caret::train(x_train2, y_target, method = "glmnet")

I am testing a bunch of models, and I see the same pattern with "xgbTree", "svmLinear", "svmRadial", and "nb".

Therefore, it seems to me that at some point, Caret is changing the data structure when using the formula input. Is it the case? Or does it simply mean that Caret can only handle factors properly when using the formula input?

I considered changing all factors into numeric before submitting data to Caret, but is this appropriate? Considering that most of the remaining data are actually numeric, how is Caret going to deal with a numeric (formerly factor) that is made of only 1 and 0? Isn't Caret going to mistakenly interpret the formerly factor features?

Please, let me know if my questions sound too confusing, and thank you in advance.

> sessionInfo()
R version 4.2.0 (2022-04-22 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=English_Canada.utf8  LC_CTYPE=English_Canada.utf8    LC_MONETARY=English_Canada.utf8 LC_NUMERIC=C                    LC_TIME=English_Canada.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] caret_6.0-94    lattice_0.20-45 ggplot2_3.4.0  

loaded via a namespace (and not attached):
 [1] jsonlite_1.8.4       splines_4.2.0        foreach_1.5.2        prodlim_2019.11.13   shiny_1.7.4          assertthat_0.2.1     highr_0.10          
 [8] stats4_4.2.0         globals_0.16.2       ipred_0.9-13         pillar_1.8.1         glue_1.6.2           pROC_1.18.0          digest_0.6.31       
[15] RColorBrewer_1.1-3   promises_1.2.0.1     hardhat_1.2.0        colorspace_2.0-3     recipes_1.0.4        htmltools_0.5.4      httpuv_1.6.8        
[22] Matrix_1.5-3         plyr_1.8.8           klaR_1.7-2           timeDate_4022.108    pkgconfig_2.0.3      labelled_2.11.0      listenv_0.9.0       
[29] haven_2.5.1          questionr_0.7.8      xtable_1.8-4         purrr_1.0.1          scales_1.2.1         later_1.3.0          gower_1.0.1         
[36] lava_1.7.1           timechange_0.2.0     tibble_3.1.8         proxy_0.4-27         combinat_0.0-8       generics_0.1.3       ellipsis_0.3.2      
[43] xgboost_1.7.3.1      withr_2.5.0          nnet_7.3-17          cli_3.6.0            mime_0.12            survival_3.3-1       magrittr_2.0.3      
[50] tokenizers_0.3.0     janeaustenr_1.0.0    future_1.30.0        fansi_1.0.3          parallelly_1.33.0    nlme_3.1-157         SnowballC_0.7.1     
[57] MASS_7.3-56          forcats_0.5.2        class_7.3-20         tools_4.2.0          data.table_1.14.6    hms_1.1.2            lifecycle_1.0.3     
[64] stringr_1.5.0        kernlab_0.9-32       munsell_0.5.0        glmnet_4.1-6         compiler_4.2.0       e1071_1.7-12         rlang_1.0.6         
[71] grid_4.2.0           iterators_1.0.14     rstudioapi_0.14      miniUI_0.1.1.1       gtable_0.3.1         ModelMetrics_1.2.2.2 codetools_0.2-18    
[78] DBI_1.1.3            reshape2_1.4.4       R6_2.5.1             lubridate_1.9.0      dplyr_1.0.10         fastmap_1.1.0        future.apply_1.10.0 
[85] utf8_1.2.2           tidytext_0.4.1       shape_1.4.6          stringi_1.7.12       parallel_4.2.0       Rcpp_1.0.9           vctrs_0.5.1         
[92] rpart_4.1.16         tidyselect_1.2.0

I guess I found the answer to my question. I am sharing my finding with anyone who might encounter the same problem.

It does seem that the formula input ends up with the factor being transformed to numeric. I concluded this from the train.default function (link). Here is part of the function:

  x <- model.matrix(Terms, m, contrasts)
  cons <- attr(x, "contrast")
  int_flag <- grepl("(Intercept)", colnames(x))
  if (any(int_flag)) x <- x[, !int_flag, drop = FALSE]
  w <- as.vector(model.weights(m))
  y <- model.response(m)

  res <- train(x, y, weights = w, ...)

After removal of the "(Intercept)", the "x" object contains only the features and is later used by train. In short, the formula input just means there will be a few additional steps, but at the end, it will use the "x=" and "y=" type of input. When calling str() on "x", it describes the object as a matrix, hence, only numeric variables.

Browsing on other issues, I ended up finding this explanation by the owner of the Caret repository:

However, there are a variety of package functions whose models do not require that all of the predictors be encoded as numbers. Trees, rule-based models, naive Bayes, and others fall into this bucket.

So, if you want to keep factors as factors, use the non-formula method for train

I found this on issue #913

In summary, I concluded the following:

  1. A factor in the training data will only lead to errors if the corresponding method cannot handle factors. This seems very obvious now that I am writing it, but Caret is such an awesome and complete package that I ended up overlooking this detail.
  2. Personally, as I am testing a bunch of methods, I will keep two versions of the training and testing datasets, one with factors and the other only numeric, and will use them according to each method's requirements.

I am keeping this open just in case any contributors want to chime in. Otherwise, feel free to close it.