This repository provides data and code to reproduce data presented in "Spatial Transcriptomics to define transcriptional patterns of zonation and structural components in the liver", and show-cases the applicability and potential to study tissue transcriptomics computationally. In this study, Spatial Transcriptomics was performed on liver tissue sections of female wildtype mice of 8-12 weeks of age. All scripts presented here are designed to be used with any kind of spatial transcriptomics data and provide detailed documentation for the reproduction of the data presented in this original study. Original data and large files are placed at an external site save resources inlcuding count matrices, spot files, HE-images, masks, h5ad-files, etc. and can be accessed at zenodo.
Below is an overview of the structure of this repository, including brief descriptions of respective item.
-
data
- contains processed data to be used in the analysis (files must be downloaded from an external resource, see below)gene lists
marker genes
- contains csv files with central and portal markers of the ST data and the SC data analyis performed for the liver data of the Mouse Cell Atlasstereoscope
- contains lists of 5000 most variable genes of single cell Mouse Cell Atlas data and Halpern et al study for sterescope analysisveins
- contains shortlists of marker genes for portal and central veins for expression by distance analysis
sterescope/sc
- single cell data used for thestereoscope
analysis
-
scripts
- contains processing scripts and notebooksLiver-ST.Rmd
- contains a R markdown script to perform canonical correlation analyis, clustering and DGEA, tissue visualization, correlation analysis, visualization of single cell integration using single cell data of the Mouse Cell Atlas and comparative analyses with published data from Halpern et alMultiCCA.R
- contains code for the modified canonical correlation analysis function used inLiver-ST.Rmd
cluster-interaction-analysis.ipynb
- notebook outlining the cluster interaction analysis (used to produce Supplementary Image 2)synthetic-data-analysis.ipynb
- generation and analysis of synthetic spatial transcriptomics data, to compare correlation values between observed and random cell distributions.make-gene-list.R
- script to generate list of highly variable genes (hvgs) to use instereoscope
mapping.prepare-data.py
- program with CLI to generateh5ad
files for spatial analysis (invein-analysis.ipynb
). Seevein-analysis.ipynb
- notebook outlining the feature by distance analysis and vein type classification/prediction based on NEPs.proportion-analysis.ipynb
- similar to first part ofvein-analysis.ipynb
but looking at the cell type proportions -compared to expression levels - (obtained fromstereoscope
) as a function of distance to the nearest vein.
-
res/sterescope-res/
CNX_ZY
- folder for each sample (X,Y, and Z are various parts of identifiers). Each folder contains aW*.tsv
file which are thesterescope
results, formatted as[n_spots]x[n_types]
matrices.st_loss.txt
- loss output for st-modelsc_loss.txt
- loss output for sc-modelst_model.pt.gz
- gzipped fitted st-modelsc_model.pt.gz
- gzipped fitted sc-modelR.tsv.gz
- gzipped estimated R paramters (instereoscope
model)logits.tsv.gz
- gzipped esitmated logits parameters (instereoscope
model)
-
hepaquery
- files constituting thehepaquery
package -
setup.py
- installation file forhepaquery
module
The code should be compatible with all systems that support R
and python
. We recommend the following software versions:
R >= 3.5
python >= 3.7
To reproduce the analysis presented in the jupyter-notebooks
and R markdown
files, the packages listed below must be installed
To facilitate reproduction of our results and enable easy exploration of similar
data sets using the methods we present in this work, we have packaged these into
a Python module called hepaquery
. This, among other things, contains functions
to generate feature_by_distance plots, classification of vein types based on
neighborhood expression profiles (NEPs), and evaluation of prediction results.
The hepaquery
package mainly relies on the python scientific framework, and
will be installed automatically when running the commands in the next section.
However, to list the packages and their recommended versions:
scikit-misc
numpy>=1.19.0
pandas>=1.0.0
anndata>=0.7.5
scipy>=1.5.4
scikit-learn>=0.23.2
matplotlib>=3.3.3
To install this package, do:
- Enter a terminal window, change directory to this repository and type
$> python3 setup.py install
Depending on your OS and user configurations, you might have to add --user
for
this to work. The installation takes only a few seconds on a standard laptop
computer.
- Next, to test if the installation work, enter the following into the terminal:
$> python3 -c "import hepaquery; print(hepaquery.__version__)"
If everything went as expected, this should print the version of hepaquery
in your terminal.
The notebook scripts/vein-analysis.ipynb
illustrates the usage of hepaquery
.
Here the examples of the expected output when running hepaquery
is found, and
instructions regarding how to reproduce the plots presented in the manuscript.
The analysis on the data in this study takes less than 5 minutes when run on a
laptop computer.
To conduct the feature by distance analysis, the data must first be curated and
formatted. For this purpose we provide the scripts/prepare-data.py
script,
which offers a convenient CLI to easily produce the required h5ad
files from
the raw count matrices, spot-files and HE-images. One additional key element
are the masks indicating veins in the tissue. These masks should align with
the HE-images (same dimensions), and have each veins colored by their class.
YAML files are then used to specify which elements to include in the assembly of
the h5ad
object.
For each item associated with a unique sample, create a YAML file
(filename.yaml
) with the following structure:
count_data: PATH_TO_COUNT_FILE
spot_data: PATH_TO_SPOT_FILE
image: PATH_TO_HE-IMAGE_FILE
mask: PATH_TO_MASK_FILE
rgb:
class1: [R1,G1,B1]
class2: [R2,G2,B2]
class3: [R3,G3,B3]
where class1
indicate the name of the first class (e.g., central) and [R1,G1,B1]
gives the RGB value that indicate class1 in the mask. This is what we refer to as a configuration file
Next, simply go to the scripts
folder and run in a terminal run:
$> python3 ./prepare-data.py -i CONFIG_FILE.yaml -n N_TYPES -s -d -o OUT_DIR
And h5ad
files containing the essential information used in the
vein-analysis.ipynb
files will be produced. For more information regarding the
parameters that may be used, simply do python3 ./prepare-data.py -h
.
While GitHub supports storage of large files via the LFS system, we have placed
our files at an external site to prevent unnecessary use of resources. The count
matrices, spot files, HE-images and masks can be accessed at this link.
To download all data and place it in the expected (by the scripts) location, you
can also go to scripts
, open a terminal and do:
$> chmod +x ./fetch-data.sh
$> ./fetch-data.sh ZENDO_LINK
In case the above does not work for you, simply go to the link referenced above,
download the files and place the content of Hepaquery_data.zip
in the data
folder. This should be equivalent to what the script is doing for you.
This work is covered under the MIT License.