***************************************************************** *-- Data and code to reproduce figures and results in paper --* *-- --* *-- 'Accurate error control in high dimensional association --* *-- testing using conditional false discovery rates' --* *-- --* *-- --* *-- James Liley and Chris Wallace, 2020 --* *-- Correspondence: JL, james.liley@igmm.ed.ac.uk --* ***************************************************************** This folder contains all relevant material to reproduce plots and results in the paper above. We assume that the associated R package 'cfdr' is loaded. If it is not, use library(devtools) install.github("jamesliley/cfdr") As a failsafe, the directory code/ contains a reproduction of the package in code/functions.R. This can be sourced instead. In some areas (mostly large-scale simulations), the complete regeneration of all data used in the paper takes a prohibitively long time to produce. For this reason, we include a script which reproudces a single run of the simulation, and several matrices of results from previous runs. Prior to executing a run of the simulation, a folder should be created in the same directory as this README called 'simulations' (lowercase). In all processes involving random number generation, we set a random seed explicitly. For this reason, all results should match those in the text exactly. In this guide, we will outline the subdirectories in this folder, then run through what each script does, and what each data object contains. An additional README in ./data explains columns of the matrix of simulation results. All code should be ran with the folder containing this README as the working directory. File paths are otherwise relative. The bottom of this readme indicates code and package versions, which for full reproduction should be matched exactly. ***************************************************************** *-- Directories --* ***************************************************************** This folder contains four subdirectories code: contains R scripts outputs: contains figures included in the manuscript (as PDFs) and tables of results simulations: directory to which simulation output is written (empty) data: contains raw datasets used for analysis and matrices of already-ran simulations ***************************************************************** *-- Scripts --* ***************************************************************** Folder 'code' contains six files: code/run_simulation.R: runs a single iteration of the simulation, given a random seed and other parameters code/submit_codes.txt: details the scripts used to generate each class of simulation results, and gives instructions for reproducing them code/simulation_analysis.R: given matrices of simulation results and GWAS data, generates tables and draws plots as pdfs to 'outputs' directory. code/twas_analysis.R: runs the analysis of transcriptome-wide association study data in the motivating example in the paper. code/functions.R: a reproduction of all necessary code in the package cfdr. code/reproducibility_check.R: checks R and package versions, and for each simulation matrix (except sim_parametric_adjustment) chooses a random row and reproduces it. code/run_sim_paradj.R: a shortened version of run_simulation.R which uses a parametric adjustment for parametric cFDR. Used for reproducibility only. ***************************************************************** *-- Data objects --* ***************************************************************** Folder 'data' contains 15 objects. Nine are simulation matrices, two are .RData files generated by code/simulation_analysis.R, and three (in subfolder TWAS) are objects relating to the TWAS analysis. The final object is a README describing columns of simulation matrices. data/sim_gen_high_fdr.txt: matrix of simulation results for general circumstances, with parameters randomly chosen from distribuation specified in manuscript, controlling FDR at 0.1. data/sim_gen_high_fdr_null.txt: matrix of simulation results for general circumstances, with parameters chosen randomly from distribution specified in manuscript conditional on n1p + n1pq=0; that is, no true associations, controlling FDR at 0.1 data/sim_gen_low_fdr.txt: matrix of simulation results for general circumstances, with parameters randomly chosen from distribuation specified in manuscript, controlling FDR at 0.01. data/sim_gen_low_fdr_null.txt: matrix of simulation results for general circumstances, with parameters chosen randomly from distribution specified in manuscript conditional on n1p + n1pq=0; that is, no true associations, controlling FDR at 0.01 data/sim_fixed.txt: matrix of simulation results in which parameters are chosen from one of several fixed parameter sets. data/sim_cov.txt: matrix of simulation results with parameters selected randomly as for sim_gen_high_fdr.txt, but with dependent observations according to either a block diagonal or equicorrelated covariance matrix. data/sim_cov_null.txt: matrix of simulation results with parameters selected randomly as for sim_gen_high_fdr_null.txt; that is, with no true associations, and with dependent observations according to either a block diagonal or equicorrelated covariance matrix. data/sim_unrelated.txt: matrix of simulation results with parameters chosen as for sim_gen_high_fdr.txt, but conditioning on n1pq=0; that is, no shared associations. data/sim_parametric_adjustment.txt: matrix of simulation results using parametrised version of cFDR only, and using an 'adjustment' based on the parametrisation rather than the empirical CDF. data/iterated_cfdr_data.RData: data used in assessing iterated cfdr. This is deterministically generated by a block of code in code/simulation_analysis.R, but because it takes several hours, it is saved and restored rather than regenerated. data/convergence_data.RData: data used for drawing figure showing convergence of various L-regions. This is deterministically generated by a block of code in code/simulation_analysis.R, but as it takes several hours to generate it is saved and restored rather than regenerated. data/TWAS/raw/BCAC.dat: raw data for breast cancer GWAS, downloaded from twashub.org. data/TWAS/raw/OCAC.dat: raw data for ovarian cancer GWAS, downloaded from twashub.org. data/TWAS/twas_summary.RData: processed TWAS data, generated deterministically by code/twas_analysis. Takes several hours to generate, so saved and restored rather than regenerated. README.txt: an explanation of the columns of each of the simulation matrices ***************************************************************** *-- R and package versions --* ***************************************************************** R version 3.3.3 mnormt version 1.5.5 mgcv version 1.8.17 pbivnorm version 0.6.0 MASS version 7.3.45 fields version 8.10 matrixStats version 0.51.0 latex2exp version 0.4.0 maps version 3.1.1 spam version 1.4.0 grid version 3.3.3 nlme version 3.1.131.1 Output of sessionInfo() on reproduction: > sessionInfo() R version 3.3.3 (2017-03-06) Platform: x86_64-redhat-linux-gnu (64-bit) Running under: Scientific Linux 7.8 (Nitrogen) locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 LC_PAPER=en_GB.UTF-8 [8] LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] grid stats graphics grDevices utils datasets methods base other attached packages: [1] latex2exp_0.4.0 fields_8.10 maps_3.1.1 spam_1.4-0 MASS_7.3-45 pbivnorm_0.6.0 mgcv_1.8-17 nlme_3.1-131.1 mnormt_1.5-5 matrixStats_0.51.0 loaded via a namespace (and not attached): [1] magrittr_1.5 Matrix_1.2-8 tools_3.3.3 stringi_1.1.6 stringr_1.2.0 lattice_0.20-34 >
jamesliley/cfdr_pipeline
Pipeline to generate all results in manuscript 'Accurate error control in high dimensional association testing using conditional false discovery rates', by James Liley and Chris Wallace
R