SchlossLab/mikropml

find_feature_importance fails for multiclass data

Closed this issue · 6 comments

Hi,

Thanks for developing mikropml - it's an amazing package for implementing machine learning methods in microbiome data.

I have been trying to apply it to a complex multiclass dataset I am working with. However, it seems that whenever I run run_ml() with find_feature_importance = TRUE, I get the following error

Training complete.
Finding feature importance...
Error in calc_perf_metrics(test_data, trained_model, outcome_colname,  : 
  subscript out of bounds

The error does not occur if I use a dataset with just two classes. I was also able to reproduce the error with the dataset otu_mini_multi that is provided as an example in the repo, so it does not seem to be specific to my data.

Any ideas on what could be the issue? I am using R v3.6.1, mikropml v1.0.0, caret v6.0-88 and future.apply v1.8.1.

Thanks in advance,
Alex

I wasn't able to reproduce the error with R 4.0.3 and mikropml 1.1.0, nor with your versions of the software (expect I had to use future.apply 1.7.0 because 1.8.1 requires R >= 4). Can you provide the code that reproduced the error using the otu_mini_multi dataset?

Here's the code I used for testing:

library(mikropml)
ml_results <- run_ml(otu_mini_multi,
  "glmnet",
  outcome_colname = "dx",
  find_feature_importance = TRUE,
  seed = 2019,
  cv_times = 2
)

And here's how I created the conda environment with your software versions:

mamba create -n R-3.6.1 r-base=3.6.1 r-caret=6.0-88 r-mikropml=1.0.0 r-future.apply

Thanks for following this up. Interesting... if I use your exact code it works. However, if I generate the otu_mini_multi from the otu_large_multi.csv file, I get the error. See my code below:

library(mikropml)

otu_large_multi <- read.delim("otu_large_multi.csv", sep = ",")
otu_mini_multi <- otu_large_multi[, 1:11]

ml_results <- run_ml(otu_mini_multi,
                     "glmnet",
                     outcome_colname = "dx",
                     find_feature_importance = TRUE,
                     seed = 2019,
                     cv_times = 2
)
Using 'dx' as the outcome column.
Training the model...
Training complete.
Finding feature importance...
Error in calc_perf_metrics(test_data, trained_model, outcome_colname,  : 
  subscript out of bounds
In addition: Warning messages:
....

Both datasets seem to be identical, so not sure what is going on.

I seem to have figured it out. If I read in the file with "stringsAsFactors = FALSE" it works. If I recall correctly this is the default in R v4 now, so this might have been the reason.

Ahh stringsAsFactors strikes again! Glad you figured it out.

By the way, you can also instead read in the file with readr::read_csv("otu_large_multi.csv"). It won't convert strings to factors unless you explicitly specify the col_types. https://readr.tidyverse.org/reference/read_delim.html

Ah, excellent, thanks for the tip. I will also eventually have to update to R v4, but have been a bit hesitant to do so fearing it will break all my scripts.

Thanks again for the help!