scverse/anndataR

Interface for writing to disk

Closed this issue · 8 comments

We talked a bit about how the classes should be structured and how the interface to existing files should work but not so much about how writing a new file should work. We want users to be able to write a SingleCellExperiment (and maybe a Seurat???) to disk but there are different ways that could work. It would be nice to decide soonish before we get too far into implementing things. Possible options (I'm sure there are more I haven't thought of):

Option 1: Reverse conversion

Functions/methods that match the conversion to R objects but in reverse. Something like:

interface <- to_AnnData(sce, interface = "InMemory"/"H5AD"/"Zarr"...)

Then the interface could have a writing method:

interface$write("path/to/file")

This somewhat matches the Python interface but probably exposes the intermediate objects to normal users too much which I think we want to avoid.

Option 1B

You could also probably do something like this where the method acts on the abstract class (although I'm not sure exactly how that would work)

interface <- to_AnnData(sce)
interface$write_h5ad("path/to/file")

Option 2: Direct reading

The other object is to have explicit writeH5AD()/writeZarr()... functions that act on SingleCellExperiment/Seurat objects (and maybe also the interfaces?). Something like:

writeH5AD(sce, "path/to/file")

This could work directly on the object or maybe by hiding Option 1:

writeH5AD <- function(sce) {
	interface <- to_AnnData(sce, interface = "H5AD")
	interface$write("path/to/file")
}

We would probably want matching readH5AD() etc. functions as well. I think this interface is nicer/easier for users but maybe a bit more work because it doesn't match as well with the classes (particularly the in-memory ones).

Option 3: Some combination of these/something else

Thoughts?

Option 3: I initially thought we would have a function called from_SingleCellExperiment and from_Seurat, e.g.

Originally, I'd planned to have these functions simply return an InMemoryAnnData:

from_SingleCellExperiment <- function(sce) {
  # construct X, obs, var, ...
  InMemoryAnnData$new(...)
}

sce <- ...
adata <- from_SingleCellExperiment(sce)
h5ad <- adata$to_HDF5()
h5ad$write("dataset.h5ad")

However, it does make sense to have an argument to automatically perform this transformation for you:

from_SingleCellExperiment <- function(sce, output_class = c("InMemory", "HDF5", "Reticulate")) {
  output_class <- match.arg(output_class)

  # construct X, obs, var, ...

  # construct output object
  out <- InMemoryAnnData$new(...)
  if (output_class == "HDF5") {
    out <- out$to_HDF5()
  } else if (output_class == "Reticulate") {
    out <- out$to_Reticulate()
  }
  out
}

For large objects I think one would like to go directly from an SCE or other R object to disk, rather than through an intermediary.

I wonder if it isn't possible to write a single from_SummarizedExperiment(sce, <class generator>) in the same way that I'm aiming to have a single to_SummarizedExperiment(). So the user would say

from_SingleCellExperiment(sce, InMemoryAnnData, <class-specific options as '...'?>). # or other *AnnData

and the implementation of from_SingleCellExperiment() would use the interface defined by AbstractAnnData(), including new(...), to construct the relevant object in a way that did not require detailed knowledge of each (or any) class.

That's true, although I'd be happy if this package simply works to begin with ☺️

Would it be sufficient to add a constructor to each of the classes to create an empty AnnData with the right dimensions that we can then start copying data into? That would allow the following:

from_SingleCellExperiment <- function(sce, output_class = c("InMemory", "HDF5", "Reticulate")) {
  output_class <- match.arg(output_class)

  # figure out shape first
  shape <- ...

  # construct output object
  class <- 
    if (output_class == "InMemory") {
      InMemoryAnnData
    } else if (output_class == "HDF5") {
      HDF5AnnData
    } else if (output_class == "Reticulate") {
      ReticulateAnnData
    }
  out <- class$empty(shape = shape)

  # fill in slots one by one

  # return output
  out
}

@mtmorgan does that address your concern?

Sure I'll try to have an implementation later today as part of #59

Likewise with #60 ;)

Hmm, something like class$empty(shape = shape) might not be feasible for the HDF5AnnData, because you immediately need to provide a path for the new file (see #65).

What would make the interface consistent is adding the path argument to each of the create_empty_XXX methods:

create_empty_InMemoryAnnData <- function(shape, path = NULL) {
  stopifnot(is.null(path))
  # ...
}
create_empty_HDF5AnnData <- function(shape, path = NULL) {
  stopifnot(!is.null(path))
  # ...
}

In #72 we decided to use an alternative approach for creating empty AnnData's from scratch. In essence, functions like from_SingleCellExperiment() and from_Seurat() assume the interface of any of the AbstractAnnData classes is generator$new(X, obs, var, obs_names, var_names, layers, ...) (perhaps not in that particular order).

For example, I'm changing the interface of the HDF5AnnData to:

    #' @description HDF5AnnData constructor
    #'
    #' @param file The filename (character) of the `.h5ad` file. Alternatively,
    #' you can also provide an object created by `[rhdf5::H5Fopen()]`. If this
    #' file does not exist yet, you must provide values for `X`, `obs`, `var` to
    #' create a new AnnData with.
    #' @param X Either NULL or a observation x variable matrix with
    #'   dimensions consistent with `obs` and `var`.
    #' @param layers Either NULL or a named list, where each element
    #'   is an observation x variable matrix with dimensions consistent
    #'   with `obs` and `var`.
    #' @param obs A `data.frame` with columns containing information
    #'   about observations. The number of rows of `obs` defines the
    #'   observation dimension of the AnnData object.
    #' @param var A `data.frame` with columns containing information
    #'   about variables. The number of rows of `var` defines the variable
    #'   dimension of the AnnData object.
    #' @param obs_names Either NULL or a vector of unique identifiers
    #'   used to identify each row of `obs` and to act as an index into
    #'   the observation dimension of the AnnData object. For
    #'   compatibility with *R* representations, `obs_names` should be a
    #'   character vector.
    #' @param var_names Either NULL or a vector of unique identifers
    #'   used to identify each row of `var` and to act as an index into
    #'   the variable dimension of the AnnData object.. For compatibility
    #'   with *R* representations, `var_names` should be a character
    #'   vector.
    initialize = function(file, X, obs, var, obs_names, var_names, layers) {

The inmemory one can stay as is.

Closed by #72