Harmonize artifical dataset creation

Question

Harmonize artifical dataset creation

Opened this issue 2 months ago · 1 comments

Is your feature request related to a problem? Please describe.
There are multiple artificial dataset creation functions. It should be clear which ones are most useful and when.

Describe the solution you'd like
Merge or document the different artificial dataset implementations. Ideally, the default one and the benchmarking one are merged and the ones from libraries using SpatialData can reuse some functionality to make more specific artificial datasets.

Additional context
Here is a list of some implementations:

spatialdata.datasets.blobs
- default basic option, slow and limited in use
- https://github.com/scverse/spatialdata/blob/main/src/spatialdata/datasets.py
from benchmarks.utils import make_blobs
- https://github.com/berombau/spatialdata/blob/benchmark-asv/benchmarks/utils.py
- very fast using adapted code from https://github.com/napari/napari/blob/195bbd0720fce1bae665cd18ccee5456a095b830/napari/benchmarks/utils.py#L175
SOPA blobs
- https://github.com/gustaveroussy/sopa/blob/f1f5a99ee7f5a9489e511241a3a62bb520ec9860/sopa/utils/data.py#L188
- more irregular cell shapes, genes from list
Harpy cluster_blobs
- https://github.com/saeyslab/harpy/blob/main/src/sparrow/datasets/cluster_blobs.py
- multisample, multichannel, ground truth cell type annotation

Answer 1 · 2024-11-27T14:56:29.000Z

Thanks for tracking this in a issue! I'd add also that spatialdata.datasets.blobs() is used in a lot of tests, so making it faster would lead to faster testing.