Adaptive Sum of Powered Tests (aSPU) in Julia
Installation
The package may be installed from github:
(v1.0) pkg> add https://github.com/kaskarn/JaSPU
or
julia> using Pkg
julia> Pkg.add("https://github.com/kaskarn/JaSPU")
The Adaptive Sum of Powered Tests
The Adaptive Sum of Powered Tests (aSPU) is used in genome-wide association settings to evaluate the effect of SNPs across k traits. To do so, it uses k z-scores from previous regression analyses, and aggregates them as multiple sums of powered scores:
Where S may either be untransformed z-scores (the default), or z-scores transformed with the inverse of R:
, and gamma takes on integer values, by default:
with
equivalently computed as
aSPU adaptively selects the SPU with the greatest power, and performs Monte-Carlo simulations to produce p-values, using the empirical multivariate-normal distribution of null z-scores.
Package description
The package provides the function aspu
to compute SPU and aSPU p-values for each SNP contained in an input file. This implementation
fully uses available processors (which must be previously added with addprocs()
, or ClusterManagers
functions), has a minimal
memory footprint, and is orders of magnitude faster than R implementations.
Input file
The input file must include one SNP per line, with the first column containing SNP names, and, by default, each subsequent column
containing a z-score value; this default behavior can be changed using the option skip = n, so that z-scores start at the nth
column, instead of the second one. Lines with missing values ("NA" by default, or specified with the option na = string
) are
permitted, and will appear in the output with missing entries.
Output file
The output file uses the same delimiter as the input file. It replicates the input files, and adds columns for the aspu p-value, the p-values for each SPU score (each gamma), and the best-powered gamma value (in the case of ties, the higher gamma value is returned).
By default, the output file is named following the pattern "aspu_results_1eN_filein", where N is the number of iterations,
and filein is the input filename. By default, the output file is placed in a timestamped folder created in the working directory.
These behaviors can be overriden by specifying out = path
, where path
can be a directory (to override the location of results), or
a file name.
Usage
The aspu
function has two required arguments: a path to the input file, and the number of iterations used to calculate p-values.
Since Monte-Carlo simulations are used to compute p-values, the minimum achievable aSPU p-value using N iterations will be
The number of iterations should be provided as a power of 10, any other number will be rounded up to the next power of 10 (e.g. 200,000 -> 1,000,000)
aspu(
filein::AbstractString, maxiter::Int64; #required options
pows::Vector{Int64} = collect(0:8), invR_trans::Bool = false, #key aSPU parameters
covfile::AbstractString = "", plim::Float64 = 1e-4, #R estimation options
delim::Char = '\t', noheader::Bool = false, skip::Int64 = 1, #input file options
out::AbstractString = "", verbose::Bool = true, nosavecov::Bool = false, #output options
outtest::Real = Inf #testing/development
)
A simple, common example would be to compute aSPU with default gammas, allowing p-values to well exceed the genome-wide significance threshold of 5e-8:
aspu("myzscores.txt", 10^9) #use default output
aspu("myzscores.txt", 10^9, out = "aspu_results/myresults.txt") #save output to specific destination
aspu("myzscores.txt", 10^9, plim = 1e-5) #use a lower threshold for null SNPs when computing Z correlation
Users may choose a more succinct set of gamma values, since added gains from large gammas are uncertain. A reasonable alternative set may be 1, 2, 3, and infinity. We use 0 to represent infinity, and the aspu
call can be made as:
aspu("myzscores.txt", 10^9, pows = [0, 1, 2, 3])
Performance
Below are rough estimates for the CPU-hours spent on Monte-Carlo simulations, estimated on the UNC-Chapel hill high-performance computing cluster (longleaf):
Iterations | minimum p-value | CPU-hours |
---|---|---|
10^9 | 1E-9 | 2.8 |
10^10 | 1E-10 | 24 |
10^11 | 1E-11 | 264 |
These do not count the time taken to read and process SNPs, which depends on the number of SNPs. In our case, it took 5.5 hours to run aspu
on 18M SNPs, using 10^11 iterations, and 60 CPUs.
Memory usage
The aspu
function does not create particularly large objects