matrix-api: A repository from johnkerl

Unified Single-cell Data Model and API

Opportunity:

The programming language and toolchain used to analyse single cell data determines the format that the data will be published in. These language- and toolchain-driven data silos inhibit use cases like model training that require bringing data to tools.
Best-in-class algorithms are often available in only a single language ecosystem or toolchain, or take substantial effort to make portable.
Multimodal data, which are becoming more common, lack standardization and support--particularly in the python ecosystem.
Data are becoming large enough that moving serialized objects around will soon be infeasible - cloud optimized formats will be required to support the next analysis phase, and out of core processing is becoming increasingly important.

We envision an API that enables users to slice and compute on large (100s millions observations) single cell datasets stored in the cloud using the AnnData, SingleCellExperiment, and Seurat toolchains.

To get started, we will focus on enabling data to be used by all major toolchains and on nailing the multi-modal data use cases. We believe this will be sufficient to drive API adoption. With data silos broken, larger data use cases will be enabled and the need for a cloud-optimized data format will be more widely felt.

Initial Focus:

A standardized single-cell container, with (basic) read & query access to the data in the container.
Import/export from all commonly used in-memory formats (eg, AnnData, SingleCellExperiment, or Seurat)
Access to underlying native (eg, TileDB) objects to allow advanced use cases
Python and R support

Longer term goals:

Native support in the popular toolchains, reducing the reliance on on-the-fly conversion (eg, anndata2ri)
Incremental modification and composition of the dataset
Optimizations required for performant analysis

The core functions of the initial API are:

Compose a "sc dataset" Python object out of pre-existing TileDB arrays. This object is the composition of one or more sc_groups (terms defined below).
Simple access to sc_dataset/sc_group properties (eg, obs, var, X) and ability to slice/query on the entire object based on obs/var labels.
For Python, Import/export an in-memory AnnData object (for subsequent use with AnnData/ScanPy) from either a slice/query result or the entire object. For R, same basic function but to/from Seurat and SingleCellExperiment

This initial draft proposes an API for single-cell data that attempts to unify the data models followed by AnnData, Bioconductor’s SingleCellExperiment, Seurat and CXG. The initial API surface is intentionally focused on a small initial set of use cases, on the assumption that API users can always escape to more complete tool chain specific API, or to underlying (advanced) native objects (eg, TileDB). We are seeking community feedback.

It first describes a general data model that captures all the above frameworks. Then it explains how TileDB can implement this model on-disk so that we have a concrete implementation reference. Next, it describes how TileDB can query this model with its generic array API “soon” (as some minor features are a work in progress). Subsequently, it proposes a more single-cell-specific API that can easily be built on top of TileDB’s, in order to hide the TileDB-specific implementation and API. We will focus only on Python and R for now for simplicity. Finally, it concludes with a list of features that will need to be implemented on the TileDB side in order to have a working prototype very soon.

Development Roadmap:

Q1 Goal: Demonstrate a proof-of-concept of the Matrix-API and a TileDB-based format implementation that can generate and be created from AnnData, MuData, SingleCellExperiment, MultiAssayExperiment, and Seurat objects.

Deliverables:

Definition of the matrix API spec for Python, R and C++, fully documented
A first pass implementation of the API using TileDB
Single-modality data support:
- Ability for round-trip AnnData -> matrix API -> write to TileDB format -> read to matrix API -> anndata
- Ability for round-trip Seurat -> matrix API -> write to TileDB format -> read to matrix API -> Seurat
- Ability for round-trip SingleCellExperiment -> matrix API -> write to TileDB format -> read to matrix API -> SingleCellExperiment
Multi-modality data support:
- Ability for round-trip MultiAssayExperiment -> matrix API -> write to TileDB format -> read to matrix API -> MultiAssayExperiment
- Ability for round-trip mudata -> matrix API -> write to TileDB format -> read to matrix API -> mudata
- Ability for round-trip Seurat -> matrix API -> write to TileDB format -> read to matrix API -> Seurat
Ability to store analysis results (e.g., graphs, reductions, etc)

Plan:

Category	Milestone	Date
Foundation	Ability to import a h5ad file to the TileDB on-disk format that the matrix API will use	Feb 4
Foundation	Implement C++ API of the matrix API spec with read support from the TileDB on-disk format	Feb 4
Python	Define the in-memory format spec for the Python API	Feb 4
Python	Build the Python API wrapper for the C++ API that implements the matrix API, with focus on reads	Feb 4
Python	Implement to_anndata from the Python in-memory objects of the matrix API	Feb 18
Python	Implement from_anndata to the in-memory format of the matrix API spec	Feb 18
R	Define the in-memory format spec for the R API (done, need to document)	March 4
R	Build the R API that wraps the C++ API of the matrix API spec with focus on reads	Feb 11
R	Implement to_seurat from the R in-memory objects of the matrix API	March 4
R	Implement from_seurat to the in-memory format of the matrix API spec (done, need to document	Feb 18
R	Implement from_single_cell_experiment for bioconductor	March 4
R	Implement to_single_cell_experiment for bioconductor	March 4
Common	Write from the in-memory TileDB formats to TileDB on disk	March 4
Common	Storage of analysis results, such as, graphs, reductions, etc	Mar 11
Common	Multimodal support with sc_dataset	Mar 18

johnkerl/matrix-api

Unified Single-cell Data Model and API

Opportunity:

Development Roadmap:

Deliverables:

Plan: