PRStuning: Estimate Testing AUC for Binary Phenotype Using GWAS Summary Statistics from the Training Data
-
If you use this software, please cite https://www.nature.com/articles/s41467-023-44009-0
-
Software dependency: Python version: >=3.8; R; PLINK; GSL (optional for speedup)
Python package dependency: numpy; pandas; scipy; scikit-learn; rpy2; cython; tables; pysnptools; CythonGSL; bgen-reader
wget https://github.com/lscientific/PRStuning/archive/refs/heads/master.zip -O PRStuning.zip
unzip PRStuning.zip
rm PRStuning.zip
cd ./PRStuning-master
For Linux/Mac, see https://solarianprogrammer.com/2020/01/26/getting-started-gsl-gnu-scientific-library-windows-macos-linux/#gsl_usage_example for detailed installation
For Windows, you can download the compiled library directly from https://code.google.com/p/oscats/downloads/list
Copy bin/libgsl-0.dll and bin/libgslcblas-0.dll into the working directory
Special tips for installing GSL on Windows can be found at http://joonro.github.io/blog/posts/installing-gsl-and-cythongsl-in-windows/
pip install scipy numpy pandas scikit-learn rpy2 cython tables pysnptools CythonGSL bgen-reader
python setup.py build_ext --inplace
Installation is expected to be finished within a few minutes
-
(Optional) Use
plinkLD.py
to calculate the LD matrix for PLINK or Oxford binary format genotype data of a reference panel. Seepython plinkLD.py --help
for further usage description and options -
Use
PRStuning.py
to obtain estimated AUC using GWAS summary statistics from the training data. Seepython PRStuning.py --help
for further usage description and options
python plinkLD.py --bfile BFILE [--bed BED] [--bim BIM] [--fam FAM] [--block: BLOCK_FILE] [--snplist SNPLIST] \
[--output OUTPUT] [--method METHOD] [--thread THREAD] [--compress COMPRESS] [--log LOG]
--bfile BFILE
Binary data file
--bed: BIM
Binary data file (Genotypes)
--bim: BIM
Binary data file (SNP info)
--fam: FAM
Binary data file (Individual info)
--bgen: BGEN
Binary data file with Oxford format for getting dosages
--block: BLOCK_FILE
Block file (Default: all SNPs are in one block)
--snplist: SNPLIST
SNP list file (Default: all SNP pairs are calculated)
--output: OUTPUT
Output filename (Default: LD.h5)
--method: METHOD
Correlation estimation method, including Pearson, LW (Default: Pearson)
--thread: THREAD
Thread number for calculation (Default: Total CPU number)
--compress: COMPRESS
Compression level for output (Default: 9)
--log: LOG
log file (Default: plinkLD.log)
python PRStuning.py --ssf SSF --weight WEIGHT --n0 N0 --n1 N1 [--pruning] [--ref REF] [--homo HOMO] [--geno GENO] \
[--pheno PHENO] [--aligned] [--n N] [--dir DIR] [--thread THREAD]
--ssf SSF
GWAS Summary statistic file. Should be a text file with columns SNP/CHR/BP/BETA/SE
--weight WEIGHT
PRS model weight file. Should be a tabular text file with the five columns being SNP/CHR/BP/A1/A2/ and the following columns correspond to weights with different PRS model parameters
--n0 N0
Control sample size of the GWAS summary statistics
--n1 N1
Case sample size of the GWAS summary statistics
--pruning
If not pruning, a Gibbs-sampling-based State-Augmentation for Marginal Estimation (SAME) is used. If pruning, EM algorithm is used on SNPs with non-zero PRS weights. Do not provide a reference LD file if pruning.
--ref REF
Reference LD file. Provide if not pruning. Should be a (full path) hdf5 file storing the LD matrix and corresponding SNP information (plinkLD.py can be used to convert PLINK binary files into the LD hdf5 file)
--homo HOMO
If the summary statistics are from a single homogeneous GWAS cohort, use --homo option to not further shrink LD (Default: False)
--geno GENO
Individual-level genotype data for testing purposes (if provided). Should be PLINK binary format files with extension .bed/.bim/.fam
--pheno PHENO
External phenotype data in a tabluar text file. Should be provided if testing genotype data are provided. If the header is not provided, the first, second columns and third columns should be FID, IID, and PHE. Otherwise, there are three columns named 'FID', 'IID', and 'PHE'
--dir DIR
Output directory (Default: ./output)
--thread THREAD
Number of parallel threads, by default all CPUs will be utilized (Default: all CPUs are utilized)
The reference panel based on the 1000 Genomes Project is available here
- 1000G.EUR.QC.bim&fam&bed are the 1000G plink files
- fourier_ls-all.bed is the SNP block file
Generate the reference LD matrix file using plinkLD.py
python plinkLD.py --bfile 1000G.EUR.QC --block fourier_ls-all.bed --out LD.h5
Note that binary data file with Oxford format (bgen file) can be an alternative to the PLINK file
For European data, the LD matrix file is already available here
- ld_1kg_EUR_hm3_shrink.h5 is the LD matrix file
./simdata/: Simulated data with correlated SNPs
- GWAS training dataset ssf.txt with sample size N0=5000, N1=5000
- PRS weight file weight_ldpred.txt is generated using LDpred. The parameters are the fractions of causal markers
- PRS weight file weight_pt.txt only contains independent SNPs after pruning. The parameters are p-value thresholds
- Testing genotype data ./simdata/geno(.bim/bed/fam) have 500 controls and 500 cases
- Reference data are generated from 2000 controls
- All datasets share the same LD structure AR(1) with rho=0.2
When --pruning
is not used, the SNPs are treated as dependent. Thus use Gibbs sampling-based SAME algorithm to estimate.
- Step 1
UseplinkLD.py
to generate the reference LD matrix file
python plinkLD.py --bfile ./simdata/ref --output ./simdata/ref.h5
- Step 2
Use--homo
here since using simulated data. For real GWAS summary statistics that are from a single homogeneous GWAS cohort, do not use this option.
python PRStuning.py --ssf ./simdata/ssf.txt --weight ./simdata/weight_ldpred.txt --ref ./simdata/ref.h5 --pheno ./simdata/pheno.txt --geno ./simdata/geno --n0 5000 --n1 5000 --dir ./simdata/output/ --homo
- Results
./simdata/output/auc_results.txt
includes the PRStuning and testing AUC results. Each row corresponds to a PRS parameter and the two columns correspond to PRStuning AUC and testing AUC respectively
Files with prefix./simdata/output/align_
are aligned datasets
./simdata/output/prs_results/
contains PRS scoring results using PLINK
./simdata/output/param.txt
contains the estimated parameters using SAME algorithm
./simdata/output/beta_est.txt
contains the estimated ground truth effect sizes
./simdata/output/log.txt
is the log file for PRStuning.py
./simdata/ref.log
is the log file for plinkLD.py
When --pruning
is used, the SNPs are treated as independent. Thus use EM algorithm to estimate. No reference data should be provided
python PRStuning.py --ssf ./simdata/ssf.txt --weight ./simdata/weight_pt.txt --pheno ./simdata/pheno.txt --geno ./simdata/geno --n1 5000 --n0 5000 --dir ./simdata/output_pt/ --pruning
- Results
./simdata/output/auc_results.txt
includes the PRStuning and testing AUC results. Each row corresponds to a PRS parameter and the two columns correspond to PRStuning AUC and testing AUC respectively
Files with prefix./simdata/output_pt/align_
are aligned datasets
./simdata/output_pt/prs_results/
contains PRS scoring results using PLINK
./simdata/output_pt/param.txt
contains the estimated parameters using EM algorithm
./simdata/output_pt/log.txt
is the log file for PRStuning.py