Optimizing Precision and Power by Covariate Selection in Randomized Trials with an Application to COVID-19

Nicholas Williams, Michael Rosenblum, Iván Díaz

This repository contains the R code for the simulation studies in the paper “Optimizing Precision and Power by Covariate Selection in Randomized Trials with an Application to COVID-19.”

Directory guide

The R/ directory contains the source code for functions used in the simulations. The scripts/ directory contains R and bash scripts for cleaning the raw data, calculating true values, running the simulations on a computing cluster, and turning simulation results into LaTeX tables.

Simulation data

The COVID-19 data used in the paper will not be shared. However, synthetic versions of the data have been made available in the data/public directory.

adjrct

The estimators described in the paper are implemented with a special version of the adjrct package. The source code for the package is included with this repository as a .tar.gz file. The package can be installed with install.packages("adjrct_0.1.0.9000.tar.gz", type = "source", repos = NULL). adjrct requires a dependency not currently on CRAN, this package can be installed with remotes::install_github("nt-williams/simul").

Running simulations

Simulations are intended to be run on a high performance computing cluster using Slurm. scripts/simulate.sh establishes the computing parameters (this file contains code specific to the authors machines and will need to be modified accordingly by other users) and runs scripts/simulate.R. scripts/simulate.R uses configuration parameters defined in scripts/config.yml to execute different simulation scenarios.

Tutorial

Loading data

box::use(dgm = ./R/data)

synthetic <- dgm$covid("ordinal", private = FALSE)

Generating RCT data

tmp <- dgm$generate_data(synthetic, "ordinal", prognostic = TRUE, seed = 123, n = 500, effect_size = 3)
head(tmp)

##         age    sex   bmi         smoke  o2 num_comorbid num_symptoms bilat
## 1: 54.93498   Male 25.70            No Yes            0            8   Yes
## 2: 87.91513   Male 29.63 Former Smoker Yes            3            4   Yes
## 3: 79.86037 Female 29.12            No  No            4            1   Yes
## 4: 35.35661   Male 29.50            No Yes            0            4   Yes
## 5: 59.33744   Male 24.94            No  No            1            8    No
## 6: 61.44011 Female 30.40            No Yes            1            2    No
##    dyspnea hyper            time state_ordinal A
## 1:       1     0  2.8812500 days             0 0
## 2:       0     1  0.1166667 days             4 0
## 3:       0     1  3.9277778 days             0 1
## 4:       1     0 13.7076389 days             0 0
## 5:       1     0  4.7506944 days             0 0
## 6:       1     0 14.7111111 days             1 0

Simulating a single RCT

The simulate() function runs a single iteration of a scenario. In the following example, we run an instance for an ordinal outcome in an RCT with 500 observations where we adjust for age, sex, dyspnea, and supplemental o2 use. These covariates are prognostic of the outcome; variable selection is performed using ℓ1 regularization.

box::use(sim = ./R/simulate)

sim$simulate(synthetic, type = "ordinal", covar = c("age", "sex", "dyspnea", "o2"), 
             prognostic = TRUE, seed = 123, n = 500, effect_size = 3, algo = "lasso", 
             crossfit = FALSE)

## $log_or
## [1] 0.667766
## 
## $log_or.std.error
## [1] 0.2181065
## 
## $mannwhit
## [1] 0.4563578
## 
## $mannwhit.std.error
## [1] 0.02082584

The partition() function takes a data frame of simulation parameters where each row corresponds to a single iteration and dispatches these scenarios to the simulate() function.

Running a full simulation

This section assumes that the scripts/simulate.sh and scripts/simulate.R scripts have been modified for the users unique computing setup.

To run a full simulation, in a terminal navigate to the scripts/ directory. A simulation can then be started by calling sbatch simulate.sh {config name} where config_name specifies the configuration for the specific simulation. All configurations for the paper are defined in the scripts/config.yml file.

For example, to run the full simulation for the unadjusted estimator with a time-to-event outcome run sbatch simulate.sh sunad. Simulation results will be saved as a collection of files in the data/res directory. Results can then be extracted into a single file (in data/) by running in the terminal Rscript results.R {config_name}. LaTeX tables of extracted results are then created with the scripts/summarize.R script with a table template defined in scripts/main.brew.

The steps for fully replicating the paper are as follows:

On a computing cluster with Slurm, navigate to the scripts/ directory
Run sbatch simulate.sh {insert configuration name}
Run Rscript results.R {insert configuration name}
Repeat steps 2 and 3 for all configurations defined in scripts/config.yml
Run Rscript summarize.R

nt-williams/covid-RCT-covar