This repository contains all R code used in running and analyzing the simulation study and case study reported in the manuscript.
As the simulation study involves many simulation settings (over 27,000) and the estimation time of some methods was rather long, a custom parallel simulation framework was implemented for use on a computation cluster. While a computational cluster is not strictly needed if you are only interested in replicating a subset of the simulation scenarios or methods, you will need to configure a Redis database server (https://redis.io/) in order to run any simulations. The instructions are provided below.
The complete database of simulation results (600 MB) is available upon request.
- MixTVEM source code used in the simulation study - https://github.com/dziakj1/MixTVEM
- lcmm R package, used for estimating GMM and GBTM - https://cran.r-project.org/package=lcmm
- kml R package, used for estimating KmL - https://cran.r-project.org/package=kml
- latrend R package: The longitudinal clustering framework that we have created, originating from the learnings of this work - https://github.com/philips-software/latrend
- Either load the Rstudio project file
comparison.Rproj
, or start an R session with the working directory set to the root repository directory. - Install required packages and dependencies
install.packages(c("assertthat", "data.table", "effects", "ggplot2", "igraph", "kml", "latex2exp", "lcmm", "lpSolve", "memoise", "mvnfast", "magrittr", "multcompView", "nlme", "polynom", "R.utils", "rredis", "scales", "weights"), dependencies = TRUE)
- Create an
.Rprofile
file with the following content:
FIG_DIR <- 'figs' # directory to export figures to
TAB_DIR <- 'tabs' # directory to export model coefficient tables to
OSU_USAGE_DATA_FILE <- '../data/<rds file name>'
CASE_OSU_RESULTS_DIR <- '../caseresults' # directory where to store the models
REDIS_HOST_FILE <- 'redis/localhost.txt' # file specifying hostname and port
REDIS_PWD <- 'password' # server AUTH password
source('include.R')
Change file and directory paths as needed.
- Restart the R session. This should now automatically run the
.Rprofile
file, which you can tell by the output in the console on start-up. Theinclude.R
script loads all required packages and functions.
You should now be able to run all functions and scripts. Running simulation studies requires a Redis database server to be configured.
The Redis database stores the open jobs as well as the results of completed jobs. Parallel workers fetch jobs from the Redis queue, and store result in the respective experiment set. The benefit of storing results in the database is that it avoids the rather large file system overhead from saving thousands of small result files.
- Download the Redis binaries. Older binaries are available at https://github.com/microsoftarchive/redis/ (download link)
- Install Redis
- Make sure Redis is added to your system's
PATH
environment variable. - Let Redis use the default port (6379).
- Make sure Redis is added to your system's
WIP
- set
BASEDIR
inredis.ksh
You need to start the Redis server before you can run simulations or retrieve simulation results.
The Redis configuration file included in the repository here configures a server on port 6379 with password "password" and database saved to redis/database.rdb
. A server password is required because the simulation R code connects to Redis using authentication.
In order to start the Redis server on Windows, run redis.bat
. Alternatively, you can open the command line in the root repository directory and execute redis-server redis/redis.conf
If everything is configured correctly, you should see the following window:
If no window shows up, that indicates the Redis server failed to start. First check if the database directory path exists.
From the root directory of the repository, run
redis-server redis/redis.conf
After you have confirmed that the Redis server is running and you have opened an R session with all scripts loaded, connect to Redis in R by running redis_connect()
. You should see the message "Connected to Redis at localhost:6379.".
All simulation scenarios described in the manuscript are located inside the experiments
folder. Simulation scenarios are defined in R scripts prefixed by exp_
.
As an example, the simulation settings for the scenario involving a known number of clusters are defined and generated in exp_normal_known.R.
Specifically, the scenario with two-cluster dataset with quadratic trends and varying number of trajetories, observations, random effects, and noise, are generated using:
cases_normal2 <- expand.grid(
data = c('longdata_randquad2'),
model = c('longmodel_kml', 'longmodel_gcm2km', 'longmodel_gbtm2', 'longmodel_gmm2', 'longmodel_mixtvem_nugget'),
numtraj = c(200, 500, 1000),
numobs = c(4, 10, 25),
numclus = 2,
re = c(RE_NORM_LOW, RE_NORM_MED, RE_NORM_HIGH),
noise = c(.01, .1),
dataseed = 1:100,
seed = 1
) %>%
as.data.table() %T>%
print()
The model names passed through the model
argument are names of the functions defined in the methods
folder. This makes it relatively easy to define and evaluate new methods.
Providing dataseed = 1:100
results in 100 different datasets being generated.
After generating the table of simulation settings, we can submit them to the job queue using the experiment_submit()
function. Only jobs which have not been previously evaluated are added.
redis_connect() # connect to Redis first
experiment_submit(name = 'normal_known', cases = cases_normal2)
The submitted jobs now need to be evaluated. This evaluation is done by worker instances.
To start a simulation worker on Windows, run worker.bat
.
However, for this to work, R
needs to be in your PATH
environment variable so Windows can locate the R executable file.
On Linux, in the command line from the repository directory, run
R --slave -f redis/worker.R
On computational clusters, you can start worker batch jobs in a similar manner.
You can start as many workers as your system allows. The workers will pull jobs from the queue and evaluate them. When no more jobs are open, the workers will terminate.
You can also evaluate jobs in the master R session by sourcing the redis/worker.R
script.
Jobs
job_monitor() # monitor number of remaining jobs over time
job_count() # returns number of open jobs
job_clear() # clear the job queue
Experiments
experiment_names() # get list of evaluated experiments
experiment_delete(name) # delete all results of the respective experiment
Simulation results can be retrieved and analyzed at any moment in time, returning all job results that have been completed up to that moment. All simulation scenario analysis scripts are located inside the experiments
folder, prefixed by analysis_
.
Methods output their results as a named list of scalar values. Results can therefore be easily combined into a table. All evaluated cases can be retrieved as a single data.table
object using the experiment_getOutputTable()
function.
results_normal_all <- experiment_getOutputTable('normal_known')
head(results_normal_all)