/xfuse

Super-resolved spatial transcriptomics by deep data fusion

Primary LanguagePythonMIT LicenseMIT

XFuse: Deep spatial data fusion

https://github.com/ludvb/xfuse/workflows/build/badge.svg?branch=master

This repository contains code for the paper “Super-resolved spatial transcriptomics by deep data fusion”.

Nature Biotechnology: https://doi.org/10.1038/s41587-021-01075-3

BioRxiv preprint: https://doi.org/10.1101/2020.02.28.963413

Hardware requirements

XFuse can run on CPU-only hardware, but training new models will take exceedingly long. We recommend running XFuse on a GPU with at least 8 GB of VRAM.

Software requirements

XFuse has been tested on GNU/Linux but should run on all major operating systems. XFuse requires Python 3.8. All other dependencies are pulled in by pip during the installation.

Installing

To install XFuse to your home directory, run

pip install --user git+https://github.com/ludvb/xfuse@master

This step should only take a few minutes.

Getting started

This section will guide you through how to start an analysis with XFuse using data on human breast cancer from [fn:1].

[fn:1]: https://doi.org/10.1126/science.aaf2403

Data

The data is available here. To download all of the required files for the analysis, run

# Image data
curl -Lo section1.jpg https://www.spatialresearch.org/wp-content/uploads/2016/07/HE_layer1_BC.jpg
curl -Lo section2.jpg https://www.spatialresearch.org/wp-content/uploads/2016/07/HE_layer2_BC.jpg
curl -Lo section3.jpg https://www.spatialresearch.org/wp-content/uploads/2016/07/HE_layer3_BC.jpg
curl -Lo section4.jpg https://www.spatialresearch.org/wp-content/uploads/2016/07/HE_layer4_BC.jpg

# Gene expression count data
curl -Lo section1.tsv https://www.spatialresearch.org/wp-content/uploads/2016/07/Layer1_BC_count_matrix-1.tsv
curl -Lo section2.tsv https://www.spatialresearch.org/wp-content/uploads/2016/07/Layer2_BC_count_matrix-1.tsv
curl -Lo section3.tsv https://www.spatialresearch.org/wp-content/uploads/2016/07/Layer3_BC_count_matrix-1.tsv
curl -Lo section4.tsv https://www.spatialresearch.org/wp-content/uploads/2016/07/Layer4_BC_count_matrix-1.tsv

# Alignment data
curl -Lo section1-alignment.txt https://www.spatialresearch.org/wp-content/uploads/2016/07/Layer1_BC_transformation.txt
curl -Lo section2-alignment.txt https://www.spatialresearch.org/wp-content/uploads/2016/07/Layer2_BC_transformation.txt
curl -Lo section3-alignment.txt https://www.spatialresearch.org/wp-content/uploads/2016/07/Layer3_BC_transformation.txt
curl -Lo section4-alignment.txt https://www.spatialresearch.org/wp-content/uploads/2016/07/Layer4_BC_transformation.txt

Preprocessing

XFuse uses a specialized data format to optimize loading speeds and allow for lazy data loading. XFuse has inbuilt support for converting data from 10X Space Ranger (xfuse convert visium) and the Spatial Transcriptomics Pipeline (xfuse convert st) to its own data format. If your data has been produced by another pipeline, it may need to be wrangled into a supported format before continuing. Feel free to open an issue on our issue tracker if you run into any problems or to request support for a new platform.

The data from the Data section was produced by the Spatial Transcriptomics Pipeline, so we can run the following commands to convert it to the right format:

xfuse convert st --counts section1.tsv --image section1.jpg --transformation-matrix section1-alignment.txt --scale 0.15 --save-path section1
xfuse convert st --counts section2.tsv --image section2.jpg --transformation-matrix section2-alignment.txt --scale 0.15 --save-path section2
xfuse convert st --counts section3.tsv --image section3.jpg --transformation-matrix section3-alignment.txt --scale 0.15 --save-path section3
xfuse convert st --counts section4.tsv --image section4.jpg --transformation-matrix section4-alignment.txt --scale 0.15 --save-path section4

It may be worthwhile to try out different values for the --scale argument, which downsamples the image data by the given factor. Essentially, a higher scale increases the resolution of the model but requires considerably more compute power.

Verifying tissue masks

It is usually a good idea to verify that the computed tissue masks look good. This can be done using the script ./scripts/visualize_tissue_masks.py included in this repository:

curl -LO https://raw.githubusercontent.com/ludvb/xfuse/master/scripts/visualize_tissue_masks.py
python visualize_tissue_masks.py */data.h5

The script will show the tissue images with the detected backgrounds blacked out. If tissue detection fails, a custom mask can be passed to xfuse convert using the --mask-file argument (see xfuse convert visium --help for more information).

Configuring and starting the run

Settings for the run are specified in a configuration file. Paste the following into a file named my-config.toml:

[xfuse]
network_depth = 6
network_width = 16
min_counts = 50

[expansion_strategy]
type = "DropAndSplit"
[expansion_strategy.DropAndSplit]
max_metagenes = 50

[optimization]
batch_size = 3
epochs = 100000
learning_rate = 0.0003
patch_size = 768

[analyses]
[analyses.metagenes]
type = "metagenes"
[analyses.metagenes.options]
method = "pca"

[analyses.gene_maps]
type = "gene_maps"
[analyses.gene_maps.options]
gene_regex = ".*"

[slides]
[slides.section1]
data = "section1/data.h5"
[slides.section1.covariates]
section = 1

[slides.section2]
data = "section2/data.h5"
[slides.section2.covariates]
section = 2

[slides.section3]
data = "section3/data.h5"
[slides.section3.covariates]
section = 3

[slides.section4]
data = "section4/data.h5"
[slides.section4.covariates]
section = 4

Here is a non-exhaustive summary of the available configuration options:

  • xfuse.network_depth: The number of up- and downsampling steps in the fusion network. If you are running on large images (using a large value for the --scale argument in xfuse convert), you may need to increase this number.
  • xfuse.network_width: The number of channels in the image and expression decoders. You may need to increase this value if you are studying tissues with many different cell types.
  • xfuse.min_counts: The minimum number of reads for a gene to be included in the analysis.
  • expansion_strategy.DropAndSplit.max_metagenes: The maximum number of metagenes to create during inference. You may need to increase this value if you are studying tissues with many different cell types.
  • optimization.batch_size: The mini-batch size. This number should be kept as high as possible to keep gradients stable but can be reduced if you are running XFuse on a GPU with limited memory capacity.
  • optimization.epochs: The number of epochs to run. When set to a value below zero, XFuse will use a heuristic stopping criterion.
  • optimization.patch_size: The size of training patches. This number should preferably be a multiple of 2^xfuse.network_depth to avoid misalignments during up- and downsampling steps.
  • slides: This section defines which slides to include in the experiment. Each slide is associated with a unique subsection. In each subsection, a data path and optional covariates to control for are specified. For example, in the configuration file above, we have given each slide a section condition with a distinct value to control for sample-wise batch effects. If our dataset contained samples from different patients, we could, for example, also include a patient condition to control for patient-wise effects.

We are now ready to start the analysis!

xfuse run my-config.toml --save-path my-run

Tip: XFuse can generate a template for the configuration file by running

xfuse init my-config.toml section1.h5 section2.h5 section3.h5 section4.h5

Tracking the training progress

XFuse continually writes training data to a Tensorboard log file. To check how the optimization is progressing, start a Tensorboard web server and direct it to the --save-path of the run:

tensorboard --logdir my-run

Stopping and resuming a run

To stop the run before it has completed, press Ctrl+C. A snapshot of the model state will be saved to the --save-path. The snapshot can be restored by running

xfuse run my-config.toml --save-path my-run --session my-run/exception.session

Finishing the run

Training the model from scratch will take roughly three days on a normal desktop computer with an Nvidia GeForce 20 series graphics card. After training, XFuse runs the analyses specified in the configuration file. Results will be saved to a directory named analyses in the --save-path.