/portal-visualization

Given HuBMAP Dataset JSON, creates a Vitessce configuration

Primary LanguagePythonMIT LicenseMIT

portal-visualization

Given HuBMAP Dataset JSON, creates a Vitessce configuration.

Release process

This is a dependency of portal-ui and search-api.

Updates that are more than housekeeping should result in a new release:

  • bump VERSION.txt.
  • make a new git tag: V=$(cat VERSION.txt); git tag $V; git push origin $V.
  • make a release on github.
  • in portal-ui, update requirements.in and rebuild requirements.txt.
  • in search-api, just update requirements.txt.

CLI

Installing this package locally makes vis-preview.py available:

$ cd portal-visualization
$ pip install .
...
$ src/vis-preview.py --help
usage: vis-preview.py [-h] (--url URL | --json JSON) [--assaytypes_url URL]
                      [--assets_url URL] [--token TOKEN] [--marker MARKER]
                      [--to_json]

Given HuBMAP Dataset JSON, generate a Vitessce viewconf, and load vitessce.io.

optional arguments:
  -h, --help            show this help message and exit
  --url URL             URL which returns Dataset JSON
  --json JSON           File containing Dataset JSON
  --assaytypes_url URL  AssayType service; default:
                        https://ingest.api.hubmapconsortium.org/assaytype/
  --assets_url URL      Assets endpoint; default:
                        https://assets.hubmapconsortium.org
  --token TOKEN         Globus groups token; Only needed if data is not public
  --marker MARKER       Marker to highlight in visualization; Only used in
                        some visualizations.
  --to_json             Output viewconf, rather than open in browser.

Background

flow chart

Data for the Vitessce visualization almost always comes via raw data that is processed by ingest-pipeline airflow dags. Harvard often contributes our own custom pipelines to these dags that can be found in portal-containers. The outputs of these pipelines are then converted into view configurations for Vitessce by the portal backend, using code in this repo, when a Dataset that should be visualized is requested in the client. The view configurations are built using the Vitessce-Python API.

Imaging Data

HuBMAP receives various imaging modalities (microscopy and otherwise). The processing is fairly uniform, and always includes running ome-tiff-pyramid + a pipeline for extracting byte offsets to optimize visualization load speeds of large imaging datasets. Vitessce is able to view OME-TIFF files directly via Viv. Two pipelines are commonly used for processing the image data with a more analytic orientation: Cytokit is used to produce segmentations (+ stitching if the input data is tiled) for downstream analysis and SPRM is one such analytic pipeline that does clustering and quantification. Below are common questions and answers for imaging modalities:

Has the data been validated via ingest-validation-tools and confirmed to be viewable using Avivator (which loads data almost identically to what is in the portal)?

If so, we should ask the TMC to follow the instructions below for viewing their data in Avivator to make sure it looks right (should only need to be done for a single representative file): https://github.com/hms-dbmi/viv/tree/master/tutorial

In the above instructions they should only need to a) run the bioformats2raw-raw2ometiff pipeline and then b) drag-and-drop or select the input file using the "CHOOSE A FILE" button on avivator.gehlenborglab.org. There is no need for a web server.

If there is a z or t stack to the data, ensure that each "stack" is uploaded as a single file.

If it is valid in these three senses (viewable in Avivator locally, passes ingest-validation-tools, and "stacks" are uploaded as single files), then ingestion may be done and pipeline processing may proceed.

Is there "spot" data, such as resolved probe locations from a FISH assay that needs to be visualized as a Vitessce molecules data type?

If the answer is "yes," we should run the image pyramid pipeline + offsets on the appropriate imaging data. We currently do not have a pipeline for visualizing spot data. Create a new class that inherits from ViewConfBuilder to visualize the data (raw imaging + spot data) when such a pipeline is created. If there is segmentation data coming from the TMC or elsewhere, then that will need to be both processed (via sprm-to-anndata.cwl from portal-containers or a different pipeline that ideally outputs zarr-backed AnnData) and visualized as well.

Will Cytokit + SPRM be run?

If the answer is "yes," we should run sprm-to-anndata.cwl from portal-containers on the output of SPRM and the image pyramid pipeline + offsets on the output of Cytokit. Extend StitchedCytokitSPRMViewConfBuilder to handle this assay.

Will only SPRM be run (on non-Cytokit Segmentations)?

If the answer is "yes," we should run sprm-to-anndata.cwl from portal-containers from portal-containers on the output of SPRM and the image pyramid pipeline + offsets on the raw input data. Create a new class that extends MultiImageSPRMAnndataViewConfBuilder, similar to StitchedCytokitSPRMViewConfBuilder if needed for multiple images in the same dataset. Otherwise you may use SPRMAnnDataViewConfBuilder with the proper arguments.

For everything else...

Run the image pyramid pipeline + offsets on the raw input data. Attach the assay to a new class in the portal backend similar to SeqFISHViewConfBuilder or ImagePyramidViewConfBuilder. This will depend on how you want the layout to look to the end user.

Sequencing Data

xxxx-RNA-seq

Currently, RNA-seq data comes as AnnData h5ad files from Matt's pipeline. Vitessce is able to view AnnData directly when saved as zarr. In order to visualize the data, the following steps must be taken to alter the the incoming AnnData h5ad file:

  1. Chunked correctly for optimal viewing
  2. Marker genes located in the obs part of the store (so they may be visualized as pop-overs when hovered)
  3. A filter for a subset of genes (corresponding to the marker genes) is stored so that it may be rendered as a heatmap.
  4. Save this altered dataset as a .zarr store.

These steps are exexuted by the anndata-to-ui container that is run after Matt's pipeline; The view config is generated by RNASeqAnnDataZarrViewConfBuilder. Currently the portal backend cannot handle slide-seq, which is a spatially resolved RNA-seq assay, but its ViewConfBuilder class will look be the same as RNASeqAnnDataZarrViewConfBuilder, except for an additional spatial_polygon_obsm="X_spatial" argument to the AnnDataWrapper as well as a SPATIAL vitessce component in the view config.

xxxx-ATAC-seq

Currently only the (mis-named) h5ad-to-arrow pipeline is used to convert h5ad AnnData files to json that contains only the scatterplot results of the scanpy analysis. In the future, vitessce-python (or something similar) should be used as a new container to process the SnapATAC-backed (or other method of storage) peaks for visualization in Vitessce as genomic profiles. See here for a demo what the final result will look like.

SNARE-seq

SNARE-seq is a mix of the above two modalities and its processing and visualization is still TBD.