Brain Integrative Transcriptome Hub (BITHub) is a web resource that aggregates gene-expression data from the human brain across multiple consortia, and allows direct comparison of gene expression in an interactive manner.
Table of Contents:
All scripts for pre-processing data are in the R/preprocess
folder. Please read the R/preprocess/README.md
for more information on how to use the script to pre-process files.
Both processed bulk and single-nucleus RNA-seq human brain transcriptomic datasets were retrieved from their respective portals as highlighted in Table 1.
Dataset | Description | nSamples | Original file |
---|---|---|---|
BrainSeq | RNA-seq data of the human postmortem brain including hippocampus and dorsolateral prefrontal cortex. Collado-Torres et al used RiboZero libraries on 900 tissue samples from 551 individuals (including 286 with schizophrenia). Prenatal (age < 0; range, 14 to 22 post-conception weeks) and postnatal (age ≥ 18 years; range, 18 to 96 years) samples were used in this work. Resource: BrainSeq Phase II |
900 | Expression matrix and metadata |
BrainSpan | Samples collected and analyzed by Kang et al across multiple brain structures including 11 neocortical areas, cerebellar cortex, mediodorsal nucleus of the thalamus, striatum, amygdala, and hippocampus. Samples included prenatal (age < 0; range, 8 to 38 post-conception weeks) and postnatal (age ≥ 4 mos ; range, 4 mos to 41 years) phenotypes of the normal human brain | 524 | BrainSpan Developmental Atlas Expression matrix and metadata Additional metadata information Allen Brain Atlas mRIN: Feng et al (2015) |
GTEx | The Genotype-Tissue Expression database contains 2,642 samples of the human postmortem brain in postnatal ages (age <20; range 20 to 79 years across 13 brain regions. All samples have been collected from non-diseased individuals | 2642 | GTEx v8 Gene TPMs Metadata files Phenotype Attributes Sample Attributes |
PsychEncode | The PsychEncode dataset contains data of the dorsolateral prefrontal cortex from human postmortem tissues from prenatal (age <0; range 4 to 40 pcw) and postnatal samples. Samples include controls and individuals with diagnosis of Bipolar Disorder, ASD, Schizophrenia and Affective disorder. | 1866 | PsychEncode Resource Expression matrix Metadata file Access from Synapse |
Human Cell Atlas | Content Cell | 32,749 | Content Cell |
Velmeshev et al | Velmeshev et al generated single-nuclei from 48 post-mortem tissue samples from the prefrontal cortex, anterior cingulate and insular cortical regions. Donors included 16 control subjects and 11 patients with ASD. All samples are postnatal | 81,216 | Cells UCSC Matrix: exprMatrix.tsv.gz Values in matrix are: 10x UMI counts from cellranger, log2-transformed Raw count matrix: rawMatrix.zip |
The user also has the option to upload their own datasets which are then integrated in BITHub in the same manner as the five core datasets. However, these data do not persist in BITHub, they are only available for the session in which they are uploaded.
As the metadata annotation was heterogeneous across the datasets, rigorous harmonization was performed. For each dataset, columns specifying Age Intervals
, Regions
, Diagnosis
and Period
were also added.
Developmental Ages
Samples were binned into age intervals that were used to define developmental stages. For all samples < 20 years old, the binning was performed based on the BrainSpan Technical White Paper (Kang el al, 2011), whereas samples 20 years were binned in 10 year intervals.
To allow comparison on a consistent scale, all ages were converted to years (numeric age). For prenatal ages (labeled -pcw):
Numeric age = -(40 - pcw) 52
where pcw is the age in post-conception weeks, 40 is the total number of prenatal weeks, and 52 denotes the total number of weeks in a year.
For ages labelled in months (labelled mos):
Numeric age = mos / 12
where 12 represents the total number of months in a year.
Prenatal or postnatal tags were assigned to samples depending on their numeric age where numeric age < 0 was labeled as prenatal and numeric age < 0 as postnatal.
Ontology and nomenclature of brain regions
The brain structures were divided into 4 main categories (regions): Cortex
, Subcortex
, Cerebellum
and Spinal Cord
.
variancePartition
was used for mixed linear analysis to estimate the proportion of variance explained by the selected covariates on each gene. Highly correlated covariates cannot be included in the model, and so as a result covariates that were not strongly correlated for each dataset were selected to the the varianceParititon analysis on filtered genes. For filtering, the expression cut-off was selected at 1 RPKM/TPM/CPM in at least 10% of the samples
To allow direct comparison of datasets with different normalizations, datasets have z-score transformed mean log2 expression values
.
BITHub implements multiple functionalities and can generate z-score distribution of genes in multiple datasets for cross-comparison. Additionally, users can investigate expression properties of specific genes against multiple metadata annotations, including technical, biological and sample specific variables.
To compare the expression of a given gene or gene sets across different datasets, use the quick search bar in the homepage by either entering the gene symbol or Ensembl IDs. Comma separated values (.csv
) containing a list of genes can also be uploaded and searched for. The search page also provides the user the option to upload a new dataset to BITHub.
Once a query has been sent to the interface, the user will be directed to the Search page with the results. The gene or genes of interest will be shown in a table with their Ensembl ID, Gene Symbol and a heatmap. The heatmap denotes the relative expression of the gene amongst datasets and if that gene is present in the given dataset. The user can then navigate directly to the corresponding gene page and explore its expression properties for each dataset. Search results can be downloaded as. .csv file.
Users can explore the following properties on BITHub:
- Detection in Brain Datasets
- Gene expression across datasets for single gene
The panel of the search results show a scatter plot with z-score transformed mean of every gene in the dataset. X-axis shows dataset 1, and the y-axis shows dataset 2. The dataset of interest can be selected using the drop down menu on the right. The gene or genes of interest are highlighted in green. To allow the direct comparison of gene expression across different datasets, we have provided a scatterplot listing z-score log2 mean transformed values of gene expression. This plot shows all genes in a given dataset with the gene of interest highlighted in green. Users can use this plot to determine how well a gene is expressed amongst any two datasets.
- Gene expression across datasets for multiple genes
- Exploring gene expression relationship with metadata variables in bulk datasets
For each gene, BITHub displays interactive plots that allow the full exploration of gene expression values (CPM/TPM/RPKM - depending on the original dataset normalization) in the bulk and single-nucleus datasets. By selecting metadata variables, users have the ability to determine how gene expression of interest varies with any metadata properties such as phenotype (e.g Age, Sex ), sample characterics or sequencing metrics. Users also have the ability to filter the data based on region by selecting their region of interest from the ‘Select Brain Region’ drop down menu.
- Exploring impact of cellular composition on gene expression
For bulk datasets, BITHub provides information of cell-type deconvolution from the original study. Users can explore these proeprties by selecting the cell-types from the metadata panel under Sample Characertics.
Currently BITHub only provides these composition estimates for BrainSeq and PsychEncode data. However, we are working on a pipeline to standardize deconvolution estimates for these datasets.
- Drivers of variation
BITHub incorporates results from varianceParition. The bar-graph for the variance partition shows the fraction of variance explained against selected metadata variables. The varianceParition results are currently only available for the bulk datasets.
- Exploring single-cell properties
- Removing a specific annotation for overview of properties
For each panel, the data displayed can be downloaded as a .csv
file, and the corresponging plot as an image file (either .svg
or .png
).
This is NOT necessary for end users but is included for future work. During the first run, some database files may be downloaded from the source - this amounts to a few hundred megabytes. Liftover is also performed during this first run, subsequent runs should be faster.
BITHub Requires Python3 and all the python libraries listed in requirements.txt.
After executing main.py, the pipeline/output folder will contain relevant hdf5 and bb files.
By copying these into the resources/data folder, you may update the website's core datasets and add new databases.
The website may be run locally but requires e.g. "Web Server for Chrome" due to webworkers.
To search entries and plot metadata, go to our GitHub Pages site.
Datasets may be added easily via the website interface but are not persistent, which would require editing the YAML and running the Python pipeline.