/echolocatoR

Automated statistical and functional fine-mapping pipeline with extensive API access to datasets.

Primary LanguageC++MIT LicenseMIT

) ) ) ) ))) 🦇 echolocatoR 🦇 ((( ( ( ( (

Automated statistical and functional fine-mapping with extensive access to genome-wide datasets.

Fine-mapping methods are a powerful means of identifying causal variants underlying a given phenotype, but are underutilized due to the technical challenges of implementation. echolocatoR is an R package that automates end-to-end genomics fine-mapping, annotation, and plotting in order to identify the most probable causal variants associated with a given phenotype.

It requires minimal input from users (a GWAS or QTL summary statistics file), and includes a suite of statistical and functional fine-mapping tools. It also includes extensive access to datasets (linkage disequilibrium panels, epigenomic and genome-wide annotations, QTL).

The elimination of data gathering and preprocessing steps enables rapid fine-mapping of many loci in any phenotype, complete with locus-specific publication-ready figure generation. All results are merged into a single per-SNP summary file for additional downstream analysis and results sharing. Therefore echolocatoR drastically reduces the barriers to identifying causal variants by making the entire fine-mapping pipeline rapid, robust and scalable.

Documentation


Workflow

echoFlow


Quick installation

In R:

if(!"devtools" %in% installed.packages()){install.packages("devtools")}
devtools::install_github("RajLabMSSM/echolocatoR")

NOTE: While this GitHub repo is still private, you need to use a token to install echolocatoR using the auth_token argument (see here for details).

Robust installation (conda)

As with most softwares, installation is half the battle. The easiest way to install all of echolocatoR's dependencies (which include R, Python, and command line tools) and make sure they play well together is to create a conda environment.

  1. If you haven't done so already, install conda.

  2. Download the echoR.yml file found here (this file tells conda what to install).

  3. In command line, create the env from the .yml file:

conda env create -f <path_to_file>/echoR.yml
  1. Activate the new env:
conda activate echoR
  1. In R, install echolocatoR:
if(!"devtools" %in% installed.packages()){install.packages("devtools")}
devtools::install_github("RajLabMSSM/echolocatoR")

To make sure echolocatoR uses the packages in this env (esp. if using from RStudio), you can then supply the env name to the finemap_loci() function using conda_env="echoR".


Dependencies

For a full list of suggested packages, see DESCRIPTION.

* = optional

R

- magrittr  
- R.utils  
- dplyr  
- BiocManager 
- tidyverse
- knitr
- rmarkdown  
- data.table  
- foreign  
- reticulate  
- ggplot2    
- ggrepel  
- coloc    
- RColorBrewer   
- patchwork   
- htmltools  
- stringr    
- openxlsx  
- EnsDb.Hsapiens.v75    
- ensembldb   
- ggbio    
- BSgenome  
- Ckmeans.1d.dp  
- refGenome   

Python

- python>=3.6.1  
- pandas>=0.25.0   
- pandas-plink  
- pyarrow  
- fastparquet  
- scipy  
- scikit-learn  
- tqdm  
- bitarray  
- networkx  
- rpy2  
- requests  

Command line

  • Rapid querying of summary stats files.
  • To use it, specify query_by="tabix" in finemap_loci().
  • Used here for filtering populations in vcf files.
  • Rapid multi-core downloading of large files (e.g. LD matrices from UK Biobank).
  • To use it, specify download_method="axel" in finemap_loci().

Fine-mapping Tools

echolocatoR will automatically check whether you have the necessary columns to run each tool you selected in finemap_loci(finemap_methods=...). It will remove any tools that for which there are missing necessary columns, and produces a message letting you know which columns are missing. Note that some columns (e.g. MAF,N,t-stat) can be automatically inferred if missing.
For easy reference, we list the necessary columns here as well.
See ?finemap_loci() for descriptions of these columns.
All methods require the columns: SNP,CHR,POS,Effect,StdErr

Additional required columns:

ABF: proportion_cases,MAF

FINEMAP:A1,A2,MAF,N

PolyFun: A1,A2,P,N

PAINTOR: A1,A2,t-stat

GCTA-COJO: A1,A2,Freq,P,N

coloc: N,MAF


Datasets

For more detailed information about each dataset, use ?:

library(echolocatoR)
?NOTT_2019.interactome # example dataset

Epigenomic & Genome-wide Annotations

  • Data from this publication contains results from cell type-specific (neurons, oligodendrocytes, astrocytes, microglia, & peripheral myeloid cells) epigenomic assays (H3K27ac, ATAC, H3K4me3) from human brain tissue.

  • For detailed metadata, see:

    data("NOTT_2019.bigwig_metadata")
  • Built-in datasets:

    • Enhancer/promoter coordinates (as GenomicRanges)
    data("NOTT_2019.interactome")
    # Examples of the data nested in "NOTT_2019.interactome" object:
    NOTT_2019.interactome$`Neuronal promoters`
    NOTT_2019.interactome$`Neuronal enhancers`
    NOTT_2019.interactome$`Microglia promoters`
    NOTT_2019.interactome$`Microglia enhancers`
    ...
    ...
    • PLAC-seq enhancer-promoter interactome coordinates
    NOTT_2019.interactome$H3K4me3_around_TSS_annotated_pe
    NOTT_2019.interactome$`Microglia interactome`
    NOTT_2019.interactome$`Neuronal interactome`
    NOTT_2019.interactome$`Oligo interactome`
    ...
    ...
  • API access to full bigWig files on UCSC Genome Browser, which includes

    • Epigenomic reads (as GenomicRanges)
    • Aggregate epigenomic score for each cell type - assay combination
  • Data from this preprint contains results from bulk and single-cell chromatin accessibility epigenomic assays in 39 human brains.
    data("CORCES_2020.bulkATACseq_peaks")
    data("CORCES_2020.cicero_coaccessibility")
    data("CORCES_2020.HiChIP_FitHiChIP_loop_calls")
    data("CORCES_2020.scATACseq_celltype_peaks")
    data("CORCES_2020.scATACseq_peaks")
  • API access to a diverse library of cell type/line-specific epigenomic (e.g. ENCODE) and other genome-wide annotations.
  • API access to cell type-specific epigenomic data.
  • API access to various genome-wide SNP annotations (e.g. missense, nonsynonmous, intronic, enhancer).
  • API access to known per-SNP QTL and epigenomic data hits.

QTLs

  • API access to full summary statistics from many standardized e/s/t-QTL datasets.
  • Data access and colocalization tests facilitated through the catalogueR R package.

Enrichment Tools

  • Binomial enrichment tests between customisable foreground and background SNPs.
  • LD-informed iterative enrichment analysis.
  • Genome-wide stratified LD score regression.
  • Inlccles 187-annotation baseline model from Gazal et al. 2018.
  • You can alternatively supply a custom annotations matrix.
  • Identification of transcript factor binding motifs (TFBM) and prediction of SNP disruption to said motifs.
  • Includes a comprehensive list of TFBM databases via MotifDB (9,900+ annotated position frequency matrices from 14 public sources, for multiple organisms).

GARFIELD (under construction)

  • Genomic enrichment with LD-informed heuristics.

LD Reference Panels



Author

Brian M. Schilder, Bioinformatician II
Raj Lab
Department of Neuroscience, Icahn School of Medicine at Mount Sinai
Sinai