Harmonize artifical dataset creation
Opened this issue · 1 comments
berombau commented
Is your feature request related to a problem? Please describe.
There are multiple artificial dataset creation functions. It should be clear which ones are most useful and when.
Describe the solution you'd like
Merge or document the different artificial dataset implementations. Ideally, the default one and the benchmarking one are merged and the ones from libraries using SpatialData can reuse some functionality to make more specific artificial datasets.
Additional context
Here is a list of some implementations:
- spatialdata.datasets.blobs
- default basic option, slow and limited in use
- https://github.com/scverse/spatialdata/blob/main/src/spatialdata/datasets.py
- from benchmarks.utils import make_blobs
- SOPA blobs
- https://github.com/gustaveroussy/sopa/blob/f1f5a99ee7f5a9489e511241a3a62bb520ec9860/sopa/utils/data.py#L188
- more irregular cell shapes, genes from list
- Harpy cluster_blobs
- https://github.com/saeyslab/harpy/blob/main/src/sparrow/datasets/cluster_blobs.py
- multisample, multichannel, ground truth cell type annotation
LucaMarconato commented
Thanks for tracking this in a issue! I'd add also that spatialdata.datasets.blobs()
is used in a lot of tests, so making it faster would lead to faster testing.