mingzehuang/latentcor

Why does the user need to provide the types for each column?

Closed this issue · 2 comments

It seems that based on the example in the README using mtcars, the software should be easily able to provide "guesses" of the type for each variable, and spit them back to the user so that they see them, and can re-run with a specific list if it's not right.

Alternatively, there could be a helper function to generate the list of column types that is run first, modified if needed, and then passed to the estR (latentcor) function.

Thank you for suggestion. We created a new function get_types that automatically determines the type of each variable, and returns a vector of types compatible with the expected input to latentcor. In addition, the default for types parameter in latentcor has been changed to NULL, so if the types are not supplied, latentcor automatically runs get_types first. However, we do recommend that users supply the types explicitly if they are known in advance as automatic determination via get_types increases computational costs. For mtcars, it's not a huge increase as the dataset is small (32 samples, 11 variables)

library(microbenchmark)
microbenchmark(get_types(mtcars))
# median 497 microseconds on Mac OS with 3.1 GHz Dual-Core Intel Core i7

However, when number of variables is large, the increase is more substantial

X = matrix(rnorm(500 * 1000), 500, 1000)
microbenchmark(get_types(X))
# median 43 milliseconds on Mac OS with 3.1 GHz Dual-Core Intel Core i7

and will be even more substantial if latentcor is run as part of sub-sampling or bootstrapping routines without specifying the types explicitly. We reflect this recommendation on types specification in the latentcor documentation for types and in the updated vignette showing application of get_types to mtcars dataset

Exactly what I was imagining. Looks awesome.