find_feature_importance fails for multiclass data
Closed this issue · 6 comments
Hi,
Thanks for developing mikropml - it's an amazing package for implementing machine learning methods in microbiome data.
I have been trying to apply it to a complex multiclass dataset I am working with. However, it seems that whenever I run run_ml() with find_feature_importance = TRUE, I get the following error
Training complete.
Finding feature importance...
Error in calc_perf_metrics(test_data, trained_model, outcome_colname, :
subscript out of bounds
The error does not occur if I use a dataset with just two classes. I was also able to reproduce the error with the dataset otu_mini_multi that is provided as an example in the repo, so it does not seem to be specific to my data.
Any ideas on what could be the issue? I am using R v3.6.1, mikropml v1.0.0, caret v6.0-88 and future.apply v1.8.1.
Thanks in advance,
Alex
I wasn't able to reproduce the error with R 4.0.3 and mikropml 1.1.0, nor with your versions of the software (expect I had to use future.apply 1.7.0 because 1.8.1 requires R >= 4). Can you provide the code that reproduced the error using the otu_mini_multi
dataset?
Here's the code I used for testing:
library(mikropml)
ml_results <- run_ml(otu_mini_multi,
"glmnet",
outcome_colname = "dx",
find_feature_importance = TRUE,
seed = 2019,
cv_times = 2
)
And here's how I created the conda environment with your software versions:
mamba create -n R-3.6.1 r-base=3.6.1 r-caret=6.0-88 r-mikropml=1.0.0 r-future.apply
Thanks for following this up. Interesting... if I use your exact code it works. However, if I generate the otu_mini_multi from the otu_large_multi.csv file, I get the error. See my code below:
library(mikropml)
otu_large_multi <- read.delim("otu_large_multi.csv", sep = ",")
otu_mini_multi <- otu_large_multi[, 1:11]
ml_results <- run_ml(otu_mini_multi,
"glmnet",
outcome_colname = "dx",
find_feature_importance = TRUE,
seed = 2019,
cv_times = 2
)
Using 'dx' as the outcome column.
Training the model...
Training complete.
Finding feature importance...
Error in calc_perf_metrics(test_data, trained_model, outcome_colname, :
subscript out of bounds
In addition: Warning messages:
....
Both datasets seem to be identical, so not sure what is going on.
I seem to have figured it out. If I read in the file with "stringsAsFactors = FALSE" it works. If I recall correctly this is the default in R v4 now, so this might have been the reason.
Ahh stringsAsFactors
strikes again! Glad you figured it out.
By the way, you can also instead read in the file with readr::read_csv("otu_large_multi.csv")
. It won't convert strings to factors unless you explicitly specify the col_types
. https://readr.tidyverse.org/reference/read_delim.html
Ah, excellent, thanks for the tip. I will also eventually have to update to R v4, but have been a bit hesitant to do so fearing it will break all my scripts.
Thanks again for the help!