Cluster Analysis failing with include_factors=TRUE
haas-christian opened this issue · 1 comments
When trying to run clustering approaches on data which includes factor variables, the function cluster_analysis() does not perform as expected.
Specifically, the cluster_analysis(include_factors=TRUE) method on data with factors fails with the error message
Error in
[.data.frame
(x, names(data)) : undefined columns selected
As far as I can tell, this is because the cluster_analysis() method creates dummy variables, and in line 224 of cluster_analysis.R tries to use the column names of the df with dummy variables to subset the columns of the original df without dummy variables, causing this error.
complete_cases <- stats::complete.cases(x[names(data)])
Changing line 224 to the following would alleviate the error by only using complete cases based on the dummy variable df used for clustering including dummies:
complete_cases <- stats::complete.cases(data[names(data)])
or simply
complete_cases <- stats::complete.cases(data)
Here's a small example based on the vignette that shows the error:
library(ggplot2)
library(parameters)
library(see)
set.seed(33)
# use mixed numerical and factor variables
df <- iris[3:5]
# try clustering with include_factors. results in error message
rez_kmeans <- df %>% parameters::cluster_analysis(n = 3, method = "kmeans", include_factors = TRUE)
Thanks, should be fixed.