easystats/parameters

Cluster Analysis failing with include_factors=TRUE

haas-christian opened this issue · 1 comments

When trying to run clustering approaches on data which includes factor variables, the function cluster_analysis() does not perform as expected.
Specifically, the cluster_analysis(include_factors=TRUE) method on data with factors fails with the error message

Error in [.data.frame(x, names(data)) : undefined columns selected

As far as I can tell, this is because the cluster_analysis() method creates dummy variables, and in line 224 of cluster_analysis.R tries to use the column names of the df with dummy variables to subset the columns of the original df without dummy variables, causing this error.
complete_cases <- stats::complete.cases(x[names(data)])

Changing line 224 to the following would alleviate the error by only using complete cases based on the dummy variable df used for clustering including dummies:
complete_cases <- stats::complete.cases(data[names(data)])
or simply
complete_cases <- stats::complete.cases(data)

Here's a small example based on the vignette that shows the error:

library(ggplot2)
library(parameters)
library(see)

set.seed(33)

# use mixed numerical and factor variables
df <- iris[3:5]

# try clustering with include_factors. results in error message
rez_kmeans <- df %>% parameters::cluster_analysis(n = 3, method = "kmeans", include_factors = TRUE)
 

Thanks, should be fixed.