Roadmap
Opened this issue · 3 comments
Proposed interface
library(anndataR)
# read from h5ad/h5mu file
adata <- read_h5ad("dataset.h5ad", backend = "HDF5AnnData")
adata <- read_h5ad("dataset.h5ad", backend = "InMemoryAnnData")
# anndata-like interface (the Python package)
adata$X
adata$obs
adata$var
# optional feature 1: S3 helper functions for a base R-like interface
adata[1:10, 2:30]
dim(adata)
dimnames(adata)
as.matrix(adata, layer = NULL)
as.matrix(adata, layer = "counts")
t(adata)
# optional feature 2: S3 helper functions for a bioconductor-like interface
rowData(adata)
colData(adata)
reducedDimNames(adata)
# converters from/to sce
sce <- adata$to_SingleCellExperiment()
from_sce(sce)
# optional feature 3: converters from/to Seurat
seu <- adata$to_Seurat()
from_seurat(seu)
# optional feature 4: converters from/to SOMA
som <- adata$to_SOMA()
from_soma(som)
Class diagram
classDiagram
class AbstractAnnData {
*X: Matrix
*layers: List[Matrix]
*obs: DataFrame
*var: DataFrame
*obsp: List[Matrix]
*varp: List[Matrix]
*obsm: List[Matrix]
*varm: List[Matrix]
*uns: List
*n_obs: int
*n_vars: int
*obs_names: Array[String]
*var_names: Array[String]
*subset(...): AbstractAnnData
*write_h5ad(): Unit
to_SingleCellExperiment(): SingleCellExperiment
to_Seurat(): Seurat
to_HDF5AnnData(): HDF5AnnData
to_ZarrAnnData(): ZarrAnnData
to_InMemoryAnnData(): InMemoryAnnData
}
AbstractAnnData <|-- HDF5AnnData
class HDF5AnnData {
init(h5file): HDF5AnnData
}
AbstractAnnData <|-- ZarrAnnData
class ZarrAnnData {
init(zarrFile): ZarrAnnData
}
AbstractAnnData <|-- InMemoryAnnData
class InMemoryAnnData {
init(X, obs, var, shape, ...): InMemoryAnnData
}
AbstractAnnData <|-- ReticulateAnnData
class ReticulateAnnData {
init(pyobj): ReticulateAnnData
}
class anndataR {
read_h5ad(path, backend): Either[AbstractAnnData, SingleCellExperiment, Seurat]
}
anndataR --> AbstractAnnData
Notation:
X: Matrix
- variableX
is of typeMatrix
*X: Matrix
- variableX
is abstractto_SingleCellExperiment(): SingleCellExperiment
- functionto_SingleCellExperiment
returns object of typeSingleCellExperiment
*to_SingleCellExperiment()
- functionto_SingleCellExperiment
is abstract
OO-framework
S4, RC, or R6?
- S4 offers formal class definitions and multiple dispatch, making it suitable for complex projects, but may be verbose and slower compared to other systems.
- RC provides reference semantics, familiar syntax, and encapsulation, yet it is less popular and can have performance issues.
- R6 presents a simple and efficient OOP system with reference semantics and growing popularity, but lacks multiple dispatch and the formality of S4.
Choosing an OOP system depends on the project requirements, developer familiarity, and desired balance between formality, performance, and ease of use.
Approach
- Implement inheritance objects for
AbstractAnnData
,HDF5AnnData
,InMemoryAnnData
- Only containing
X
,layers
,obs
,var
for now - Implement base R S3 generics
- Implement
read_h5ad()
,$write_h5ad()
- Implement
$to_SingleCellExperiment()
- Add simple unit tests
Optional:
- Add more fields (obsp, obsm, varp, varm, ...) --> see class diagram
- Start implementing MuData
- Implement
$to_Seurat()
- Implement
ZarrAnnData
- Implement
ReticulateAnnData
- Implement Bioconductor S3 generics
Challenges - Previously encountered issues
Below are previously encountered issues when reading h5ad files using hdf5r. They could be
to create test cases.
- mojaveazure/seurat-disk#10: Conversion error with SeuratDisk when copying
uns
- PMBio/MuDataSeurat#8: PCA loadings issue
No test data yet:
- PMBio/MuDataSeurat#14: H5Dvlen_reclaim invalid argument
Roadmap
Should we create a public road map / can I add the following items to the project board?
Before release 0.1.0:
- Add News.md (#57)
- Implement HDF5 base slot setters
- Implement from_SingleCellExperiment (#59)
- Implement from_Seurat (#60)
- Release anndataR 0.1.0
Parallel to current release cycle:
After release 0.1.0:
All sounds good with me. Should we also start to be more disciplined about version number bumps with merges to the main branch? Maybe there's automation for that too...?
Good point. How about we release a version 0.1.0 once support for X, obs, var, obs_names, var_names and layers is implemented and unit tested for the HDF5AnnData, InMemoryAnnData, SeuratConverter and SingleCellExperimented?
From that point on, we add additional functionality by merging PRs into the main branch and adding changelog entries to CHANGELOG.md.
In my opinion we shouldn't merge #50 until the current classes are feature complete. @lazappi @mtmorgan WDYT?
I'll mention https://github.com/neurogenomics/scKirby which I recently became aware of @bschilder. A very cursory look indicates that it could definitely leverage anndataR (it currently uses anndata) when it matures. It also contains 'innovations' like reticulate & basilisk conda-based environments for structured control of Python dependencies.