Welcome to the Github accompanying the pre-print Multi-layered genetic approaches to identify approved drug targets which compares gene prioritization methods in their ability to identify historical drug targets.
If you use scripts from this Github please consider citing the pre-print:
Sadler MC, Auwerx C, Deelen P, Kutalik Z. Multi-layered genetic approaches to identify approved drug targets. medRxiv. 2023:2023-03.
This Github contains the workflow pipeline that was used to define drug targets, calculate method-specific gene scores, network diffusion scores and drug target enrichment scores (odds ratios and AUC scores). Code to calculate other statistics presented in the study such as agreement between different gene scoring methods, drug database and network statistics as well drug target heritability analyses is also part of the pipeline.
The workflow is organized as a snakemake pipeline. This allows for a reproducible analysis and parallel computations - this is especially useful since the analysis has been conducted on 30 different traits/drug indications and 3 different networks. However, individual scripts can also be run without the snakemake workflow manager by replacing snakemake variables by hard-coded input and output paths.
In this study, input data comes from various public databases which can be large datasets and sometimes needs the signing of terms and conditions (i.e. to download the deCODE data). Here, we describe the different steps that need to be taken in order to assemble all required input data. When using this data, please consider citing the sources appropriately.
To define drug targets two different sources of input data is needed:
- Indication - drug links
- Drug - gene target links
Indication - drug links have been collected from Ruiz et al., Nature Communications, 2021, DrugBank and ChEMBL. For Ruiz et al., Supplementary Table 6 needs to be downloaded, for DrugBank, the file (scripts/drug_targets/GWAS_x_drug_drugBank.py
) should be run and for ChEMBL the indication - drug link file can be downloaded here (click on "Select All" and then on the right "TSV" button to download the full file).
Drug - gene target links have been collected from DGIdb, STITCH and ChEMBL. On the DGIdb website, drug-target links can be downloaded from the Downloads page (interactions, genes and drugs files), STITCH data can be downloaded from the Download page (select "Homo sapiens" organism and then download the 9606.actions.v5.0.tsv.gz
file as well as the chemical.sources.v5.0.tsv.gz
file which contains the mapping of chemical IDs to DrugBank - since this file is very large, this pipeline assumes that the file is already subsetted to entries relevant to DrugBank and named stitch_chemical_drugbank_vocabulary.tsv
)
- GWAS scores
GWAS scores should be computed using the PascalX software. The documentation on installing the software is very detailed. With default settings, gene scores are calculated based on gene names and not EnsemblIds. The rule run_pascal_ensembl
in the Snakefile calculates gene scores based on EnsemblIds. Downloading and formatting the reference is explained in the documentation.
In this pipeline, GWAS summary statistics are assumed to be in the GCTA-COJO format (.ma), however, any other format works as well with the PascalX software as long as the field number containing p-values is specified.
- QTL-GWAS scores
QTL-GWAS scores have been calculated using the smr-ivw software. The Github wiki on univariable MR explains usage and theory behind MR gene causal effect calculations, as well as the formatting of input QTL data. For this software, GWAS summary statistics have to be in GCTA-COJO format (.ma). The rules transcript_eQTLGen_MR
and Decode_MR
in the Snakefile calculate transcript and protein QTL-GWAS scores, respectively. A reference panel in PLINK 1.9 Format (.bed, .bim, .fam files) is required.
- Exome scores
In this study, pre-calculated Exome scores from Backman et al., Nature, 2021 were used. All scores can be calculated from the GWAS catalog and the full list of mapping between GWAS Catalog accession IDs and UK Biobank traits is available in the Supplementary Data 4 of the Backman et al., Nature, 2021 paper. Files downloaded for this study are specified in data/Exome_data.csv
.
Three networks have been used in this study:
-
STRING: The STRING protein-protein interaction network v11 can be downloaded here: select "Homo sapiens" organism and then download the
9606.protein.links.v11.5.txt.gz
file. -
CoXRNAseq: The co-expression network calculated from RNA-seq samples can be downloaded here (documentation can be found here). Only the network
gene_coregulation_165_eigenvectors_protein_coding.txt
from the bundle is needed. -
FAVA: This network which is based on single cell RNA-seq read-count data from the Human Protein Atlas and proteomics data from the PRoteomics IDEntifications (PRIDE) database can be downloaded here.
Cis-heritability estimates for transcripts and proteins are available in the Supplementary tables of this study. Alternatively, they can be calculated with the LDAK software using the following command:
ldak5.2.linux --reml ldak.reml --summary sumstat --bfile refpanel --power -.25 --ignore-weights YES --region-number 1 --region-prefix region_prefix
where ldak.reml is the prefix of your output file, sumstat the summary statistics in LDAK format, refpanel the reference panel in Plink format, region_prefix the prefix of the file containing your SNPs from the region. Further details are in the documentation of the LDAK REML method.
MIT License