find_feature_importance fails for multiclass data

Question

find_feature_importance fails for multiclass data

Closed this issue 3 years ago · 6 comments

Hi,

Thanks for developing mikropml - it's an amazing package for implementing machine learning methods in microbiome data.

I have been trying to apply it to a complex multiclass dataset I am working with. However, it seems that whenever I run run_ml() with find_feature_importance = TRUE, I get the following error

Training complete.
Finding feature importance...
Error in calc_perf_metrics(test_data, trained_model, outcome_colname,  : 
  subscript out of bounds

The error does not occur if I use a dataset with just two classes. I was also able to reproduce the error with the dataset otu_mini_multi that is provided as an example in the repo, so it does not seem to be specific to my data.

Any ideas on what could be the issue? I am using R v3.6.1, mikropml v1.0.0, caret v6.0-88 and future.apply v1.8.1.

Thanks in advance,
Alex

Answer 1 · 2021-08-19T20:52:55.000Z

I wasn't able to reproduce the error with R 4.0.3 and mikropml 1.1.0, nor with your versions of the software (expect I had to use future.apply 1.7.0 because 1.8.1 requires R >= 4). Can you provide the code that reproduced the error using the otu_mini_multi dataset?

Here's the code I used for testing:

library(mikropml)
ml_results <- run_ml(otu_mini_multi,
  "glmnet",
  outcome_colname = "dx",
  find_feature_importance = TRUE,
  seed = 2019,
  cv_times = 2
)

And here's how I created the conda environment with your software versions:

mamba create -n R-3.6.1 r-base=3.6.1 r-caret=6.0-88 r-mikropml=1.0.0 r-future.apply

Answer 2 · 2021-08-19T21:39:24.000Z

Thanks for following this up. Interesting... if I use your exact code it works. However, if I generate the otu_mini_multi from the otu_large_multi.csv file, I get the error. See my code below:

library(mikropml)

otu_large_multi <- read.delim("otu_large_multi.csv", sep = ",")
otu_mini_multi <- otu_large_multi[, 1:11]

ml_results <- run_ml(otu_mini_multi,
                     "glmnet",
                     outcome_colname = "dx",
                     find_feature_importance = TRUE,
                     seed = 2019,
                     cv_times = 2
)

Using 'dx' as the outcome column.
Training the model...
Training complete.
Finding feature importance...
Error in calc_perf_metrics(test_data, trained_model, outcome_colname,  : 
  subscript out of bounds
In addition: Warning messages:
....

Both datasets seem to be identical, so not sure what is going on.

Answer 3 · 2021-08-19T21:49:29.000Z

I seem to have figured it out. If I read in the file with "stringsAsFactors = FALSE" it works. If I recall correctly this is the default in R v4 now, so this might have been the reason.

Answer 4 · 2021-08-19T21:51:16.000Z

Ahh stringsAsFactors strikes again! Glad you figured it out.

Answer 5 · 2021-08-19T21:54:48.000Z

By the way, you can also instead read in the file with readr::read_csv("otu_large_multi.csv"). It won't convert strings to factors unless you explicitly specify the col_types. https://readr.tidyverse.org/reference/read_delim.html

Answer 6 · 2021-08-19T22:05:45.000Z

Ah, excellent, thanks for the tip. I will also eventually have to update to R v4, but have been a bit hesitant to do so fearing it will break all my scripts.

Thanks again for the help!