Workflows to generate cis-regulatory datasets used for variant-to-gene (V2G) assignment in Open Targets Genetics.
Requires:
- Conda
- (Linux) split v>=8.23
- (Mac) gsplit v>=8.23
If necessary, install conda.
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
bash miniconda.sh -b -p $HOME/miniconda
echo export PATH="$HOME/miniconda/bin:\$PATH" >> ~/.profile
. ~/.profile
This only needs to be run if these datasets have been updated.
# Install dependencies into isolated environment
conda env create -n v2g_data --file environment.yaml
# Activate environment
conda activate v2g_data
# Alter configuration file
nano configs/config.yaml
# Authenticate google cloud storage
gcloud auth application-default login
# Increase spark memory if running locally
export PYSPARK_SUBMIT_ARGS="--driver-memory 8g pyspark-shell"
# Execute workflows (locally)
ncores=3
snakemake -s andersson2014_fantom5.Snakefile --cores $ncores
snakemake -s thurman2012_dhscor.Snakefile --cores $ncores
# snakemake -s javierre2016_pchic.Snakefile --cores $ncores
snakemake -s jung2019_pchic.Snakefile --cores $ncores
# Copy output to GCS
gsutil -m rsync -r -x ".*DS_Store$" output gs://genetics-portal-staging/v2g
This needs to be run if QTL datasets have been added or updated.
# Update list of QTL studies in process_QTL_datasets_from_sumstats.w_hack.py
# Start dataproc server
gcloud beta dataproc clusters create \
em-qtlprocess \
--region europe-west1 \
--image-version=preview \
--properties=spark:spark.debug.maxToStringFields=100,spark:spark.executor.cores=32,spark:spark.executor.instances=1 \
--master-machine-type=n2-standard-32 \
--master-boot-disk-size=2TB \
--num-master-local-ssds=0 \
--zone=europe-west1-d \
--initialization-action-timeout=20m \
--single-node \
--max-idle=10m
# Submit QTL processing job
gcloud dataproc jobs submit pyspark \
--cluster=em-qtlprocess \
--region europe-west1 \
process_QTL_datasets_from_sumstats.w_hack.py
# To monitor
gcloud compute ssh em-qtlprocess-m \
--project=open-targets-genetics-dev \
--zone=europe-west1-d -- -D 1080 -N
"/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" \
--proxy-server="socks5://localhost:1080" \
--user-data-dir="/tmp/em-qtlprocess-m" http://em-qtlprocess-m:8088
Follow steps in mapping/README.md, if new datasets have been added or updated.
QTL datatypes contain association statistics between individual genetic variation and a molecular phenotype (e.g. expression QTLs, and protein QTLs).
Output file naming convention:
{data_type}/{exp_type}/{source}/{cell_type}/{chrom}.pval{pval}.{cis_reg|trans_reg}.tsv.gz
Output files:
{chrom}.pval{pval}.cis_reg.tsv.gz
: cis-regulatory associations for{ensembl_id}
within 1Mb (config['sun2018_cis_window']
) of gene TSS. Filtered by pval < 2.5e-5 (config['sun2018_cis_pval']
).{chrom}.pval{pval}.trans_reg.tsv.gz
: trans-regulatory associations for{ensembl_id}
outside 1Mb (config['sun2018_cis_window']
) of gene TSS. Filtered by pval < 5e-8 (config['sun2018_trans_pval']
).
QTL output columns:
chrom
,pos
: genomic coordinates on GRCh37other_allele
: non-effect allele. Ref allele in GRCh37.effect_allele
: allele on which the effect (beta
) is measured. Effect should always be ALT allele in GRCh37!ensembl_id
: Ensembl gene IDbeta
: the effect size and directionse
: standard error ofbeta
pval
: p-value of association
Most data now comes from eQTL catalogue, including GTEx eQTLs, which have been reprocessed by eQTL catalogue.
- Publication link
- Dataset: summary statistics ingested from eQTL catalogue and other eQTL datasets
- Score = -log10(pval)
- Filters:
- Variants are filtered to keep those with p-value ≤ (0.05 / number of variants tested for the gene)
- Pre-filtered at source to remove associations >1Mb from gene TSS
This is still to be done. sQTL data will be integrated initially from GTEx v8, but may be switched to uniform data from the eQTL catalogue in the future, once this is ready.
- Publication link
- Score = -log10(pval)
- Filters:
- Same as for eQTLs
Evidence linking genetic variation to protein abundance in Sun et al. (2018) pQTL data.
- Publication link
- Dataset link
- Score = -log10(pval)
- Filters:
- Variants are filtered to keep those with p-value ≤ (0.05 / number of variants tested for the gene)
- Filtered to remove associations >1Mb from each gene TSS.
- Notes:
- 3,195 total SOMA IDs goes to 2,872 unique Ensembl gene IDs (323 lost). For these we are deduplicating by gene_id and keeping the first occurence in the manifest.
- We do not exclude "pQTLs without evidence against binding effects".
"Of 1,927 pQTLs, 549 (28%) were cis-acting (Supplementary Table 4). Genetic variants that change protein structure may result in apparent cis pQTLs owing to altered aptamer binding rather than true quantitative differences in protein levels. We found evidence against such artefactual associations for 371 (68%) cis pQTLs (Methods, Supplementary Tables 4, 7, 8). The results were materially unchanged when we repeated downstream analyses but excluded pQTLs without evidence against binding effects." Sun et al.
Interval datatypes contain genomic regions that are linked to gene transcription start site through molecular interactions (e.g. promoter Capture Hi-C).
Output file naming convention:
{data type}/{experiment type}/{source}/{cell|tissue type}/{chrom}.{processed|raw}.tsv.gz
Interval output columns:
chrom
,start
,end
: 1-based genomic co-ordinates. Start and end are inclusive.ensembl_id
: Ensembl gene IDscore
: feature score from original dataset (unmodified)cell_type
: cell or tissue type of this experiment
Evidence linking genetic variation to genes using Promoter Capture Hi-C in 27 human cell/tissue types.
- Publication link
- Dataset link: Table S3
- Score = 1 or 0
- Filters:
- Only significant P–O pcHi-C interactions included at source
Evidence linking genetic variation to genes using Promoter Capture Hi-C in each of the 17 human primary hematopoietic cell types.
- Publication link
- Dataset link
- Score = CHiCAGO score (Cairns et al., 2016)
- Filters:
- Pre-filtered at source to contain only CHiCAGO >= 5
- Filtered interactions across chromosomes and over 2.45e6 bases away on either side (see plots)
Evidence linking genetic variation to genes using correlation between the transcriptional activity of enhancers and transcription start sites using the FANTOM5 CAGE expression atlas.
- Publication link
- Dataset link
- Score = R-squared
- Filters:
- Pre-filtered at source to contain only false-discovery rate (FDR) > 0.05
- Pre-filtered at source to remove interactions >2Mb
- Filtered interactions between chromosomes and over 1,068,265 bases (2 stdev) on either side (see plots)
Evidence linking genetic variation to genes using correlation of DNase I hypersensitive site and gene promoters across 125 cell and tissue types from ENCODE.
- Publication link
- Dataset link
- Score = R-squared
- Filters:
- Pre-filtered at source to contain R-squared > 0.7 only
- Pre-filtered at source to remove interactions over 500 kb
- Pre-filtered at source to between chromsome interactions
Commands for copying from staging. NOTE: This is normally taken care of by the backend team, and the code below hasn't been updated.
version_date=`date +%y%m%d`
# Interval datasets
gsutil -m rsync -r gs://genetics-portal-staging/v2g/interval/pchic/javierre2016/$version_date/ gs://genetics-portal-data/v2g/$version_date/interval/pchic/javierre2016/
gsutil -m rsync -r gs://genetics-portal-staging/v2g/interval/fantom5/andersson2014/$version_date/ gs://genetics-portal-data/v2g/$version_date/interval/fantom5/andersson2014/
gsutil -m rsync -r gs://genetics-portal-staging/v2g/interval/dhscor/thurman2012/$version_date/ gs://genetics-portal-data/v2g/$version_date/interval/dhscor/thurman2012/
# QTL datasets
gsutil -m rsync -r gs://genetics-portal-staging/v2g/qtl/eqtl/gtex_v7/$version_date/ gs://genetics-portal-data/v2g/$version_date/qtl/eqtl/gtex_v7/
gsutil -m rsync -r gs://genetics-portal-staging/v2g/qtl/pqtl/sun2018/$version_date/ gs://genetics-portal-data/v2g/$version_date/qtl/pqtl/sun2018/