iCPAGdb

Introduction

This repo contains both backend and frontend code of iCPAGdb and the web browser.

iCPAGdb integrates the results of GWAS across different phenotypic scales, identifying and quantifying the significance of pleiotropic loci that impact molecular, cellular, and organismal traits. The goal is to provide a resource that allows experts on a particular human trait to easily develop hypotheses for molecular and cellular phenotypes that underlie the physiology of that trait. Molecules and cellular pathways implicated in this way could serve as novel biomarkers or targets for therapeutic approaches. Current verion of iCPAGdb contains GWAS summary statistic from >4400 diseases/traits, and allows users to explore pre-computed correlations across all existing diseases and/or upload their own GWAS to identify and explore shared SNPs between their own GWAS and >4400 diseases/traits.

This repo contains two parts

python (3.6+) code for iCPAGdb
R shiny code for Web browser

Update for V1.1

We added --lddb-r2 parameter to allow users choosing different LD proxy database. However, since the pre-built in GWAS dataset were clumped by PLINK using --clump-r2 0.4 for each study, we recommend to use default parameter: --lddb-r2 0.4.

Quick start

Configuration and download database and third-party software

direct download PLINK 1.9, or using Linux/Max wget function and place it to folder "plink_bins"

## please choose proper PLINK version, here is an example of Linux version 

wget http://s3.amazonaws.com/plink1-assets/plink_linux_x86_64_20201019.zip

download ziped database file (~33 Gb), and decompressed it to "db" folder from Dropbox LINK. Here is an example of downloading the required database using using wget on Linux/Mac OS.

wget https://www.dropbox.com/sh/na23jflxcgk0nib/AAAKi--r8cS44U8VboFWBTP2a/cpag_gwasumstat_v1.1.EUR_ld0.4.db  --content-disposition

The final folder structure contains all required codes and data file:
pyCPAGdb
├── _utils.py
├── anno_parent_efo.py
├── main.py
├── stats.py
├── plink_bins
│   ├── plink
│   └── prettify
├── db
│   ├── cpag_gwasumstat_v1.1.AFR_ld0.2.db
│   ├── cpag_gwasumstat_v1.1.AFR_ld0.4.db
│   ├── cpag_gwasumstat_v1.1.AFR_ld0.8.db
│   ├── cpag_gwasumstat_v1.1.EAS_ld0.2.db
│   ├── cpag_gwasumstat_v1.1.EAS_ld0.4.db
│   ├── cpag_gwasumstat_v1.1.EAS_ld0.8.db
│   ├── cpag_gwasumstat_v1.1.EUR_ld0.2.db
│   ├── cpag_gwasumstat_v1.1.EUR_ld0.4.db
│   ├── cpag_gwasumstat_v1.1.EUR_ld0.8.db
│   ├── cpag_gwasumstat_v1.2.db
│   ├── gwas-efo-trait-mappings.txt
│   └── lddat
│   ├── AFR_1kg_20130502_maf01.bed
│   ├── AFR_1kg_20130502_maf01.bim
│   ├── AFR_1kg_20130502_maf01.fam
│   ├── EAS_1kg_20130502_maf01.bed
│   ├── EAS_1kg_20130502_maf01.bim
│   ├── EAS_1kg_20130502_maf01.fam
│   ├── EUR_1kg_20130502_maf01.bed
│   ├── EUR_1kg_20130502_maf01.bim
│   └── EUR_1kg_20130502_maf01.fam

configure computing environment for python 3

The fast way is to install Miniconda and install required package from there.

Create a new environment using conda:

conda create -n icpagdb python=3.7
conda activate icpagdb

install python package using conda:

conda install -c conda-forge panda
conda install -c conda-forge scipy
conda install -c conda-forge joblib
conda install -c conda-forge tqdm
conda install -c conda-forge sqlite

Run example (Shell command)

example 1

Serum metabolites/xenobiotics (Shin et al. 2014) vs. Human disease

python main.py cpagdb --threads 2 --subtype NHGRI --NHGRI-Pcut 5e-8 \
  --subtype BloodMetabolites,BloodXenobiotic --Pcut 1e-5 \
  --lddb-pop EUR --outfile NHGRI-p1e-05-BloodMetabolitesXenobiotic-p1e-05-EUR.csv

then annotate phenotype:

python main.py post_analysis --anno-ontology --anno-cols Trait1 \
  --infile output/NHGRI-p1e-05-BloodMetabolitesXenobiotic-p1e-05-EUR.csv \
  --outfile NHGRI-p1e-05-BloodMetabolitesXenobiotic-p1e-05-EUR.csv

example 2

python main.py cpagdb --threads 2 --subtype H2P2 --H2P2-Pcut 1e-7 \
  --lddb-pop EUR --outfile output/H2P2-p1e-07-EUR.csv

example 3 (user GWAS)

download COVID-19 GWAS example from "Upload and compute CPAG" page at HERE

python main.py usr-gwas --threads 10 --infile iCPAGdb-Sample-GWAS-top_EllinghausPCs_covid19.csv \
  --SNPcol "avsnp150" --delimitor "," --Pcol "p_value" \
  --usr-pcut 1e-5 \
  --outfile top_EllinghausPCs_covid19_pcut1e-5_icpagdb_out.csv