SATORIv2 : Comprehensive Evaluation of SATORI

SATORI v2 is based on Self-ATtentiOn based deep learning model that captures Regulatory element Interactions in genomic sequences. It can be used to infer a global landscape of interactions in a given genomic dataset, with a minimal post-processing step. This repository contains code for extensive evaluation of self-attention layer in order to predict feature interactions.

Original Manuscript

Fahad Ullah, Asa Ben-Hur, A self-attention model for inferring cooperativity between regulatory features, Nucleic Acids Research, 2021;, gkab349, https://doi.org/10.1093/nar/gkab349

Dependency

SATORI V2 is written in python 3. The following python packages are required:
biopython (version 1.75)
captum (version 0.2.0)
fastprogress (version 0.1.21)
matplotlib (vresion 3.1.3)
numpy (version 1.17.2)
pandas (version 0.25.1)
pytorch (version 1.2.0)
scikit-learn (vresion 0.24)
scipy (version 1.4.1)
seaborn (version 0.9.0)
statsmodels (version 0.9.0)

and for motif analysis:
MEME suite
WebLogo

Installation

Download SATORI (via git clone):

git clone git@github.com:sairajab/satoriv2.git satori

Navigate to the cloned directory:

cd satori

Install SATORI:

python setup.py install

Make the main script (satori.py) executable:

chmod +x satori.py

(Optional) To execute the script everywhere, update the PATH and PYTHONPATH environment variables:

export PATH=path-to-satori:$PATH
export PYTHONPATH=path-to-satori/satori:$PYTHONPATH

Usage

usage: satori.py [-h] [-v] [-o DIRECTORY] [-m MODE] [--deskload] [-w NUMWORKERS]
                 [--splitperc SPLITPERC] [--motifanalysis] [--filtersanalysis]
                 [--scorecutoff SCORECUTOFF] [--tomtompath TOMTOMPATH] [--database TFDATABASE]
                 [--annotate ANNOTATETOMTOM] [-i] [--interactionanalysis] [-b INTBACKGROUND]
                 [--attncutoff ATTNCUTOFF] [--fiscutoff FISCUTOFF] [--intseqlimit INTSEQLIMIT] [-s]
                 [--numlabels NUMLABELS] [--tomtomdist TOMTOMDIST] [--tomtompval TOMTOMPVAL]
                 [--testall] [--useall] [--precisionlimit PRECISIONLIMIT] [--attrbatchsize ATTRBATCHSIZE]
                 [--method METHODTYPE] [--gt_pairs PAIRS_FILE] [--finetune_model FINETUNE_MODEL_PATH]
                 [--set_seed] [--seed SEED] [--motifweights]
                 inputprefix hparamfile



Main SATORI script.

positional arguments:
  inputprefix           Input file prefix for the bed/text file and the
                        corresponding fasta file (sequences).
  hparamfile            Name of the hyperparameters file to be used.

optional arguments:
  -h, --help            show this help message and exit
  -v, --verbose         verbose output [default is quiet running]
  -o DIRECTORY, --outDir DIRECTORY
                        output directory
  -m MODE, --mode MODE  Mode of operation: train or test.
  --deskload            Load dataset from desk. If false, the data is converted into tensors and kept in main memory (not recommended for large datasets).
  -w NUMWORKERS, --numworkers NUMWORKERS
                        Number of workers used in data loader. For loading from the desk, use more than 1 for faster fetching.
  --splitperc SPLITPERC
                        Pecentages of test, and validation data splits, eg. 10 for 10 percent data used for testing and validation.
  --motifanalysis       Analyze CNN filters for motifs and search them against known TF database.
  --filtersanalysis     Analyze CNN filters for motifs based on annotation file.
  --scorecutoff SCORECUTOFF
                        In case of binary labels, the positive probability cutoff to use.
  --tomtompath TOMTOMPATH
                        Provide path to where TomTom (from MEME suite) is located.
  --database TFDATABASE
                        Search CNN motifs against known TF database. Default is Human CISBP TFs.
  --annotate ANNOTATETOMTOM
                        Annotate tomtom motifs. The value of this variable should be path to the database file used for annotation. Default is None.
  -i, --interactions    Self attention based feature(TF) interactions analysis.
  --interactionanalysis
                        interactions analysis with ground truth interactions
  -b INTBACKGROUND, --background INTBACKGROUND
                        Background used in interaction analysis: shuffle (for di-nucleotide shuffled sequences with embedded motifs.), negative (for negative test set). Default is not to use background (and
                        significance test).
  --attncutoff ATTNCUTOFF
                        Attention cutoff value. For a given interaction, it should have an attention value at least as high as this value across all examples.
  --fiscutoff FISCUTOFF
                        FIS score cutoff value. For a given interaction, it should have an FIS score at least as high as this value across all examples.
  --intseqlimit INTSEQLIMIT
                        A limit on number of input sequences to test. Default is -1 (use all input sequences that qualify).
  -s, --store           Store per batch attention and CNN outpout matrices. If false, the are kept in the main memory.
  --numlabels NUMLABELS
                        Number of labels. 2 for binary (default). For multi-class, multi label problem, can be more than 2.
  --tomtomdist TOMTOMDIST
                        TomTom distance parameter (pearson, kullback, ed etc). Default is euclidean (ed). See TomTom help from MEME suite.
  --tomtompval TOMTOMPVAL
                        Adjusted p-value cutoff from TomTom. Default is 0.05.
  --testall             Test on the entire dataset (default False). Useful for interaction/motif analysis.
  --useall              Use all examples in multi-label problem instead of using precision based example selection. Default is False.
  --precisionlimit PRECISIONLIMIT
                        Precision limit to use for selecting examples in case of multi-label problem.
  --attrbatchsize ATTRBATCHSIZE
                        Batch size used while calculating attributes for FIS scoring. Default is 12.
  --method METHODTYPE   Interaction scoring method to use; options are: SATORI, FIS, or BOTH. Default is SATORI.
  --gt_pairs PAIRS_FILE
                        Path to groud truth pairs file
  --finetune_model FINETUNE_MODEL_PATH
                        Path to the pre-trained model
  --set_seed            Set seed or not
  --seed SEED           Seed to intialize model
  --motifweights        Load weights of first CNN from motifs PWM and freeze them, by default false hence weights are randomly intialized and trained.).

Tutorial

Simulated Data Generation

Jaspar.meme file is required to load motif PWMs. Number of examples, data type and paths can be modified in generate_data.py .

cd create_dataset
python generate_data.py

Example: binary classification

For simulated data experiments:
Training for Data-40 using three seed values and for both BASIC and DEEP models.

python run_experiments.py

Individual experiment

satori.py data/Simulated_Data/Data-40/ctf_40pairs_eq0 modelsparam/all_exps/simulated/basic/baseline_entropy_0.005.txt --outDir results/Data-40/ctf_40pairs_eq0/baseline_entropy_0.005/E1/ --mode train -v -s --numlabels 2 --attrbatchsize 32 --set_seed --seed 0 --deskload --intseqlimit 5000 --motifanalysis --interactions --interactionanalysis --background negative --method SATORI --tomtompath PATH-TO-TOMTOM-TOOL --database create_dataset/subset40.meme --gt_pairs create_dataset/tf_pairs_40.txt

Example: multi-label classification

For the arabidopsis genomewide chromatin accessibility dataset:

satori.py data/Arabidopsis_ChromAccessibility/atAll_m200_s600 modelsparam/arabidopsis/arabidopsis_deep_entropy.txt -w 8 --outDir results/Arabidopsis_GenomeWide_Analysis --mode train -v -s --background shuffle --intseqlimit 5000 --numlabels 36 --motifanalysis --interactions --method SATORI --attrbatchsize 32 --deskload --tomtompath PATH-TO-TOMTOM-TOOL --database PATH-TO-MEME-TF-DATABASE

For the human promoters chromatin accessibility dataset:

satori.py data/Human_Promoters/encode_roadmap_inPromoter modelsparam/all_exps/human_promoters/human_promoter_deep_entropy.txt -w 8 --outDir results/Human_Promoters_Analysis/ --mode train -v -s --background shuffle --intseqlimit 5000 --numlabels 164 --motifanalysis --interactions --method SATORI --attrbatchsize 32 --deskload --tomtompath PATH-TO-TOMTOM-TOOL --database PATH-TO-MEME-TF-DATABASE

Note: make sure to specify path to the TomTom tool and the corresponding motif database.
PATH-TO-TOMTOM-TOOL path to TomTom tool in the MEME suite.
PATH-TO-MEME-TF-DATABASE path to the TF database to use (MEME suite comes with different databases).

Post-processing

The resutls are processed in separate Jupyter notebooks in the analysis directory. The notebooks assume that the results are in results folder, at the root (top level) directory of the repository.

Chip-seq Overlap Analysis

For Chip-seq overlap analysis, LOLA R-package has been used. chipseq_analysis/overlap_analysis.py takes unique interactions identified by the SATORI (paths can be modified inside the file) as input and downloads respective ChipSeq data from Chiphub and later performs the analysis.