Synthetic datasets: A non-technical primer for the biobehavioural sciences to promote reproducibility and hypothesis-generation
Synthetic datasets are an emerging method originally developed to permit the sharing of confidential census data. Synthetic datasets mimic real datasets by preserving their statistical properties and the relationships between variables. Importantly, this method also reduces disclosure risk to essentially nil as no record in the synthetic dataset represents a real individual. This is the accompanying R script for my primer manuscript, which enables scholars to create synthetic datasets and assess their utility via the synthpop
R package. By sharing synthetic datasets that mimic original datasets that could not otherwise be made open, researchers can ensure the reproducibility of their results and facilitate data exploration while maintaining participant privacy.
To launch a RStudio server instance and run my analysis scripts online, click here or on the "Launch Binder" badge below.
Once the Rstudio server instance has loaded, run the commands in the "R_script.R" file.
Due to resource constraints of the RStudio server instance, the scripts that create Supplementary Figures 1-3 described in the primer manuscript could not be included. These scripts can be found on the manuscript's Open Science Framework page.
To run the analysis locally in RStudio, download this repository as a zipped file. The R version and package versions are noted in the sessionInfo.txt
file.