BiocPy/GenomicRanges

Improve generator function consistency with R packages

Closed this issue · 3 comments

Hi @jkanche ,

I started digging into BiocPy this week and am really enjoying the suite of packages. Awesome work. One minor thing I noticed is that the generator functions don't currently work without required arguments, which differs from the R packages:

library(GenomicRanges)
GRanges()
GRanges object with 0 ranges and 0 metadata columns:
   seqnames    ranges strand
      <Rle> <IRanges>  <Rle>
  -------
  seqinfo: no sequences
library(SummarizedExperiment)
SummarizedExperiment()
class: SummarizedExperiment
dim: 0 0
metadata(0):
assays(0):
rownames: NULL
rowData names(0):
colnames: NULL
colData names(0):
library(SingleCellExperiment)
SingleCellExperiment()
class: SingleCellExperiment
dim: 0 0
metadata(0):
assays(0):
rownames: NULL
rowData names(0):
colnames: NULL
colData names(0):
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
library(MultiAssayExperiment)
MultiAssayExperiment()
A MultiAssayExperiment object of 0 listed
 experiments with no user-defined names and respective classes.
 Containing an ExperimentList class object of length 0:
 Functionality:
 experiments() - obtain the ExperimentList instance
 colData() - the primary/phenotype DataFrame
 sampleMap() - the sample coordination DataFrame
 `$`, `[`, `[[` - extract colData columns, subset, or experiment
 *Format() - convert into a long or wide DataFrame
 assays() - convert ExperimentList to a SimpleList of matrices
 exportClass() - save data to flat files

In comparison, here's what we currently see with the Python generator functions:

The GenomicRanges generator currently requires seqnames and ranges.

from biocpy.genomicranges import GenomicRanges
GenomicRanges()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: GenomicRanges.__init__() missing 2 required positional arguments: 'seqnames' and 'ranges'

SummarizedExperiment and SingleCellExperiment currently require assays.

from biocpy.summarizedexperiment import SummarizedExperiment
SummarizedExperiment()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: SummarizedExperiment.__init__() missing 1 required positional argument: 'assays'

MultiAssayExperiment currently requires experiments, col_data, and sample_map.

from biocpy.multiassayexperiment import MultiAssayExperiment
MultiAssayExperiment()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: MultiAssayExperiment.__init__() missing 3 required positional arguments: 'experiments', 'col_data', and 'sample_map'

Also, what are your thoughts on using "GRanges" instead of "GenomicRanges" for the primary generator name? This would better match current conventions in the R package. Alternatively, we could just add this as an alias to the GenomicRanges generator function.

Best,
Mike

Also, what's the current recommended method for saving objects to disk? Should we be using pickle?

Hi @mjsteinbaugh, Thank you for the feedback.

We provide a empty() method in these classes to implement similar behavior. We are currently rewriting some of the classes to adapt to a functional paradigm and can revisit the default values to the arguments in the constructor. e.g.

> ~/P/p/b/GenomicRanges on master ◦ python                             
Python 3.9.17 (main, Jul  5 2023, 15:35:09)
[Clang 14.0.6 ] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.

>>> from genomicranges import GenomicRanges
>>> GenomicRanges.empty()
GenomicRanges(number_of_ranges=0, seqnames=[], 
ranges=IRanges(start=array([], dtype=int32), width=array([], dtype=int32)), 
strand=[], 
mcols=BiocFrame(data={}, number_of_rows=0, column_names=[]), 
seqinfo<genomicranges.SeqInfo.SeqInfo object at 0x100d70d30>)

Also, what's the current recommended method for saving objects to disk? Should we be using pickle?

Regarding saving objects to disk, we are actively developing a system for storing genomic datasets in language-agnostic formats - ArtifactDB, specifically the dolomite package within ArtifactDB is designed for saving Python objects. Instead of relying on pickle/RDS files, one can opt for these formats to store, share, or access datasets across various programming languages. We plan to release a comprehensive tutorial on these features once we reach a stable release. In the meantime, pickle is the way to go.

Thanks @jkanche I'll check this out. Much appreciated!