
Distinguishing between generic and experiment-specific gene expression signals.

Generic transcriptional responses revealed using SOPHIE: Specific cOntext Pattern Highlighting In Expression data

Alexandra J. Lee, Dallas L. Mould, Jake Crawford, Dongbo Hu, Rani K. Powers, Georgia Doing, James C. Costello, Deborah A. Hogan, Casey S. Greene

University of Pennsylvania, University of Colorado Anschutz Medical Campus, Dartmouth College

There exist some genes and pathways that are differentially expressed across many gene expression experiments (Powers et. al., Bioinformatics 2018; Crow et. al., PNAS 2019). These generic findings can obscure results that are specific to the context or experiment of interest, which are often what we hope to glean when using gene expression to generate mechanistic insights into cellular states and diseases. Current methods, including Powers et. al. and Crow et. al., to identify generic signals rely on the manual curation and identical analysis of hundreds or thousands of additional experiments, which is inordinately time consuming and not a practical step in most analytical workflows. If you want to perform a new DE analysis in a different biological context (i.e. different organism, tissue, media) then you might not have the curated data available. Switching contexts will require re-curation. Similarly, using a different statistical method will require re-curation.

We introduce a new approach to identify generic patterns that uses generative neural networks to produce a null or background set of transcriptomic experiments. Analyzing a target experiment against this automatically generated background set makes it straightforward to separate generic and specific results. This approach, called SOPHIE for Specific cOntext Pattern Highlighting In Expression data, can be applied to any new platform or species for which there is a large collection of unlabeled gene expression data. Here, we apply SOPHIE to the analysis of both human and bacterial datasets, and use this method to highlight the ability to detect highly specific but low magnitude transcriptional signals that are biologically relevant. The reusable notebooks for training neural networks and for the use of pre-trained generative models for the analysis of differential expression experiments may be broadly useful for the prioritization of specific findings in complex datasets.

This method was named after one of the main characters from Hayao Miyazaki's animated film Howl’s moving castle. Sophie’s outwardly appearance as an old woman despite being a young woman that has been cursed, demonstrates that the most obvious thing you see isn't always the truth. This is the idea behind our approach, which allows users to identify specific gene expression signatures that can be masked by generic background patterns.

SOPHIE trains a a multi-layer variational autoencoder (VAE) on gene expression compendium. Then new experiments are simulated by linearly shifting the selected template experiment (i.e. real experiment selected from the training compendium or externally) to a new location in the latent space. This new location is a randomly sampled from the distribution of the low dimensional representation of the trained gene expression compendium. The vector that connects the template experiment and the new location is added to the template experiment to create a new simulated experiment. This process is repeated multile times to created multiple simulated experiments based on the single template experiment.

Directory Structure

Folder Description
LV_analysis This folder contains analysis notebooks to examine the potential role of generic genes by looking at the coverage of generic genes across PLIER latent variables) or eADAGE latent variables, which are associated with known biological pathways.
compare_experiments This folder analysis notebooks to compare multiple SOPHIE results using the same template experiment and different template experiments. This analysis tests the robustness of SOPHIE results.
configs This folder contains configuration files used to set hyperparameters for the different experiments
explore_RNAseq_only_generic_genes This folder contains analysis notebooks testing different hypotheses to explain the subset of genes found to be generic by SOPHIE trained on RNA-seq data but not found to be generic in the manually curated array dataset.
explore_data This folder contains an analysis notebook visualizing the recount2 dataset to get a sense for the variation contained.
figure_generation This folder contains a notebook toi generate figures seen in the manuscript.
generic_expression_patterns_modules This folder contains supporting functions that other notebooks in this repository will use.
human_cancer_analysis This folder contains analysis notebooks to validate generic signals using Powers et. al. dataset, which is composed of experiments testing the response of small molecule treatments in cancer cell lines, to train VAE.
human_general_analysis This folder contains analysis notebooks to validate generic signals using recount2 dataset, which contains a heterogeneous set of experiments, to train VAE.
network_analysis This folder contains analysis notebooks to examine the potential role of generic genes by looking at the clustering of generic genes within network communities.
new_experiment This folder contains analysis notebooks to identify specific and generic signals using a new experiment and an existing VAE model
other_enrichment_methods This folder contains analysis notebooks to apply different gene set enrichment methods. The default method used is GSEA.
pseudomonas_analysis This folder contains analysis notebooks to identify specific and generic signals using P. aeruginosa dataset to train VAE
tests This folder contains notebooks to test the code in this repository. These notebooks run a small dataset across the analysis notebooks found in the human_general_analysis directory.


How to run notebooks from generic-expression-patterns

Operating Systems: Mac OS, Linux (Note: bioconda libraries not available in Windows)

In order to run this simulation on your own gene expression data the following steps should be performed:

First you need to set up your local repository:

  1. Download and install github's large file tracker.
  2. Install miniconda
  3. Clone the generic-expression-patterns repository by running the following command in the terminal:
git clone https://github.com/greenelab/generic-expression-patterns.git

Note: Git automatically detects the LFS-tracked files and clones them via http. 4. Navigate into cloned repo by running the following command in the terminal:

cd generic-expression-patterns
  1. Set up conda environment by running the following command in the terminal:
bash install.sh
  1. Navigate to either the pseudomonas_analysis, human_general_analysis or human_cancer_analysis directories and run the notebooks in order.

Note: Running the human_general_analysis/1_process_recount2_data.ipynb notebook can take several days to run since the dataset is very large. If you would like to run only the analysis notebook (human_general_analysis/2_identify_generic_genes_pathways.ipynb) to generate the human analysis results found in the publication, you can update the config file to use the following file locations:

  • The normalized compendium data used for the analysis in the publication can be found here.
  • The Hallmark pathway database can be found here
  • The processed template file can be found here
  • The scaler file can be found here

How to analyze your own data using existing models

In order to run this simulation on your own gene expression data the following steps should be performed:

  • Your input dataset should be a matrix that is sample x gene
  • The gene ids should be HGNC symbols (if using human data) or PA numbers (if using P. aeruginosa data)
  • Your input dataset should be generated using the same platform as the model you plan to use (i.e. RNA-seq or array)
  • Models available to use are: recount2 (human RNA-seq model found in human_general_analysis/models), Powers et. al. (human array model found in human_cancer_analysis/models), P. aeruginosa (P. aeruginosa array model found in pseudomonas_analysis/models)

The tables lists parameters required to run the analysis in this repository. These will need to be updated to run your own analysis. The * indicates optional parameters if you are comparing the ranks of your genes/gene sets with some reference ranking. The ** is only used if using get_recount2_sra_subset (in download_recount2_data.R).

Note: Some of these parameters are required by the imported ponyo modules.

Name Description
local_dir str: Parent directory on local machine to store intermediate results.
dataset_name str: Name for analysis directory, which contains the notebooks being run. For our analysis its named "human_analysis".
raw_template_filename str: Downloaded template gene expression data file
mapped_template_filename str: Template gene expression data file after replacing gene ids in header. This is an intermediate file that gets generated.
processed_template_filename str: Template gene expression data file after removing samples and genes. This is an intermediate file that gets generated.
raw_compendium_filename str: Downloaded compendium gene expression data file
mapped_compendium_filename str: Compendium gene expression data file after replacing gene ids in header. This is an intermediate file that gets generated.
normalized_compendium_filename str: Normalized compendium gene expression data file. This is an intermediate file that gets generated.
shared_genes_filename str: Pickle file on your local machine where to write and store genes that will be examined. These genes are the intersection of genes in your dataset versus a reference to ensure that there are not Nans in downstream analysis. This is an intermediate file that gets generated.
scaler_filename str: Pickle file on your local machine where to write and store normalization transform to be used to process data for visualization. This is an intermediate file that gets generated.
reference_gene_filename* str: File that contains reference genes and their rank. Note that the values assigned to genes needs to be a rank.
reference_gene_name_col str: Name of the column header that contains the reference genes. This is found in reference_gene_filename*
reference_rank_col str: Name of the column header that contains the reference gene ranks. This is found in reference_gene_filename*
rank_genes_by str: Name of column header from DE association statistic results. This column will be use to rank genes. Select logFC, P.Value, adj.P.Val, t if using Limma. Select log2FoldChange, pvalue, padj if using DESeq.
DE_logFC_name str: "logFC" or "log2FoldChange". This is used for plotting volcano plots
DE_pvalue_name str: "adj.P.Val" or "padj". This is used for plotting volcano plots
pathway_DB_filename* str: File that contains pathways to use for GSEA
gsea_statistic str: Statistic to use to rank genes for GSEA analysis. Select logFC, P.Value, adj.P.Val, t if using Limma. Select log2FoldChange, pvalue, padj if using DESeq.
rank_pathways_by str: Name of column header from GSEA association statistic results. This column will be use to rank pathways. Select NES, padj if using DESeq to rank genes.
NN_architecture str: Name of neural network architecture to use. Format 'NN__'
learning_rate float: Step size used for gradient descent. In other words, it's how quickly the methods is learning
batch_size str: Training is performed in batches. So this determines the number of samples to consider at a given time
epochs int: Number of times to train over the entire input dataset
kappa float: How fast to linearly ramp up KL loss
intermediate_dim int: Size of the hidden layer
latent_dim int: Size of the bottleneck layer
epsilon_std float: Standard deviation of Normal distribution to sample latent space
validation_frac float: Fraction of samples to use for validation in VAE training
project_id str: Experiment id to use as a template experiment
count_threshold int: Minimum count threshold to use to filter RNA-seq data. Default is None
metadata_colname str: Header of experiment metadata file to indicate column containing sample ids. This is used to extract gene expression data associated with project_id
num_simulated int: Simulate a compendia with these many experiments, created by shifting the template experiment these many times
num_recount2_experiments_to_download** int: Number of recount2 experiments to download. Note this will not be needed when we update the training to use all of recount2


