cfdr_pipeline: An R repository from jamesliley

*****************************************************************
*--  Data and code to reproduce figures and results in paper  --*
*--                                                           --*
*--  'Accurate error control in high dimensional association  --*
*--  testing using conditional false discovery rates'         --*
*--                                                           --*
*--                                                           --*
*--  James Liley and Chris Wallace, 2020                      --*
*--  Correspondence: JL, james.liley@igmm.ed.ac.uk            --*
*****************************************************************



This folder contains all relevant material to reproduce plots 
and results in the paper above. We assume that the associated R
package 'cfdr' is loaded. If it is not, use

library(devtools)
install.github("jamesliley/cfdr")

As a failsafe, the directory code/ contains a reproduction of 
the package in code/functions.R. This can be sourced instead.

In some areas (mostly large-scale simulations), the complete 
regeneration of all data used in the paper takes a prohibitively 
long time to produce. For this reason, we include a script which
reproudces a single run of the simulation, and several matrices 
of results from previous runs.

Prior to executing a run of the simulation, a folder should be 
created in the same directory as this README called 'simulations'
(lowercase).

In all processes involving random number generation, we set a 
random seed explicitly. For this reason, all results should match
those in the text exactly.

In this guide, we will outline the subdirectories in this folder,
then run through what each script does, and what each data object
contains.

An additional README in ./data explains columns of the matrix of 
simulation results.

All code should be ran with the folder containing this README as
the working directory. File paths are otherwise relative. The 
bottom of this readme indicates code and package versions, which 
for full reproduction should be matched exactly.


*****************************************************************
*--  Directories                                              --*
*****************************************************************

This folder contains four subdirectories
 code: contains R scripts
 outputs: contains figures included in the manuscript (as PDFs) 
   and tables of results
 simulations: directory to which simulation output is written 
   (empty)
 data: contains raw datasets used for analysis and matrices of 
   already-ran simulations


*****************************************************************
*--  Scripts                                                  --*
*****************************************************************

Folder 'code' contains six files:
code/run_simulation.R: runs a single iteration of the simulation,
  given a random seed and other parameters
code/submit_codes.txt: details the scripts used to generate each 
  class of simulation results, and gives instructions for 
  reproducing them
code/simulation_analysis.R: given matrices of simulation results 
  and GWAS data, generates tables and draws plots as pdfs to 
  'outputs' directory. 
code/twas_analysis.R: runs the analysis of transcriptome-wide
  association study data in the motivating example in the paper.
code/functions.R: a reproduction of all necessary code in the 
  package cfdr. 
code/reproducibility_check.R: checks R and package versions, and 
  for each simulation matrix (except sim_parametric_adjustment)
  chooses a random row and reproduces it.
code/run_sim_paradj.R: a shortened version of run_simulation.R
  which uses a parametric adjustment for parametric cFDR. Used 
  for reproducibility only.

*****************************************************************
*--  Data objects                                             --*
*****************************************************************

Folder 'data' contains 15 objects. Nine are simulation matrices, 
 two are .RData files generated by code/simulation_analysis.R, 
 and three (in subfolder TWAS) are objects relating to the TWAS 
 analysis. The final object is a README describing columns of 
 simulation matrices.
data/sim_gen_high_fdr.txt: matrix of simulation results for
  general circumstances, with parameters randomly chosen from 
  distribuation specified in manuscript, controlling FDR at 0.1.
data/sim_gen_high_fdr_null.txt: matrix of simulation results for
  general circumstances, with parameters chosen randomly from 
  distribution specified in manuscript conditional on 
  n1p + n1pq=0; that is, no true associations, controlling FDR 
  at 0.1
data/sim_gen_low_fdr.txt: matrix of simulation results for
  general circumstances, with parameters randomly chosen from 
  distribuation specified in manuscript, controlling FDR at 0.01.
data/sim_gen_low_fdr_null.txt: matrix of simulation results for
  general circumstances, with parameters chosen randomly from 
  distribution specified in manuscript conditional on 
  n1p + n1pq=0; that is, no true associations, controlling FDR 
  at 0.01
data/sim_fixed.txt: matrix of simulation results in which 
  parameters are chosen from one of several fixed parameter sets.
data/sim_cov.txt: matrix of simulation results with parameters 
  selected randomly as for sim_gen_high_fdr.txt, but with 
  dependent observations according to either a block diagonal
  or equicorrelated covariance matrix.
data/sim_cov_null.txt: matrix of simulation results with 
  parameters selected randomly as for sim_gen_high_fdr_null.txt;
  that is, with no true associations, and with dependent 
  observations according to either a block diagonal or 
  equicorrelated covariance matrix.
data/sim_unrelated.txt: matrix of simulation results with 
  parameters chosen as for sim_gen_high_fdr.txt, but 
  conditioning on n1pq=0; that is, no shared associations.
data/sim_parametric_adjustment.txt: matrix of simulation results 
  using parametrised version of cFDR only, and using an 
  'adjustment' based on the parametrisation rather than the 
  empirical CDF.
data/iterated_cfdr_data.RData: data used in assessing iterated 
  cfdr. This is deterministically generated by a block of code in 
  code/simulation_analysis.R, but because it takes several hours,
  it is saved and restored rather than regenerated.
data/convergence_data.RData: data used for drawing figure showing 
  convergence of various L-regions. This is deterministically 
  generated by a block of code in code/simulation_analysis.R, but
  as it takes several hours to generate it is saved and restored 
  rather than regenerated.
data/TWAS/raw/BCAC.dat: raw data for breast cancer GWAS, 
  downloaded from twashub.org.
data/TWAS/raw/OCAC.dat: raw data for ovarian cancer GWAS, 
  downloaded from twashub.org.
data/TWAS/twas_summary.RData: processed TWAS data, generated 
  deterministically by code/twas_analysis. Takes several hours to
  generate, so saved and restored rather than regenerated.
README.txt: an explanation of the columns of each of the 
  simulation matrices


*****************************************************************
*--  R and package versions                                   --*
*****************************************************************

R version 3.3.3
mnormt version 1.5.5
mgcv version 1.8.17
pbivnorm version 0.6.0
MASS version 7.3.45
fields version 8.10
matrixStats version 0.51.0
latex2exp version 0.4.0
maps version 3.1.1
spam version 1.4.0
grid version 3.3.3
nlme version 3.1.131.1

Output of sessionInfo() on reproduction:

> sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Scientific Linux 7.8 (Nitrogen)

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8     LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8    LC_PAPER=en_GB.UTF-8      
 [8] LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] latex2exp_0.4.0    fields_8.10        maps_3.1.1         spam_1.4-0         MASS_7.3-45        pbivnorm_0.6.0     mgcv_1.8-17        nlme_3.1-131.1     mnormt_1.5-5       matrixStats_0.51.0

loaded via a namespace (and not attached):
[1] magrittr_1.5    Matrix_1.2-8    tools_3.3.3     stringi_1.1.6   stringr_1.2.0   lattice_0.20-34
>
jamesliley/cfdr_pipeline