Improve generator function consistency with R packages

Question

Improve generator function consistency with R packages

Closed this issue a year ago · 3 comments

I started digging into BiocPy this week and am really enjoying the suite of packages. Awesome work. One minor thing I noticed is that the generator functions don't currently work without required arguments, which differs from the R packages:

library(GenomicRanges)
GRanges()

GRanges object with 0 ranges and 0 metadata columns:
   seqnames    ranges strand
      <Rle> <IRanges>  <Rle>
  -------
  seqinfo: no sequences

library(SummarizedExperiment)
SummarizedExperiment()

class: SummarizedExperiment
dim: 0 0
metadata(0):
assays(0):
rownames: NULL
rowData names(0):
colnames: NULL
colData names(0):

library(SingleCellExperiment)
SingleCellExperiment()

class: SingleCellExperiment
dim: 0 0
metadata(0):
assays(0):
rownames: NULL
rowData names(0):
colnames: NULL
colData names(0):
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):

library(MultiAssayExperiment)
MultiAssayExperiment()

A MultiAssayExperiment object of 0 listed
 experiments with no user-defined names and respective classes.
 Containing an ExperimentList class object of length 0:
 Functionality:
 experiments() - obtain the ExperimentList instance
 colData() - the primary/phenotype DataFrame
 sampleMap() - the sample coordination DataFrame
 `$`, `[`, `[[` - extract colData columns, subset, or experiment
 *Format() - convert into a long or wide DataFrame
 assays() - convert ExperimentList to a SimpleList of matrices
 exportClass() - save data to flat files

In comparison, here's what we currently see with the Python generator functions:

The GenomicRanges generator currently requires seqnames and ranges.

from biocpy.genomicranges import GenomicRanges
GenomicRanges()

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: GenomicRanges.__init__() missing 2 required positional arguments: 'seqnames' and 'ranges'

SummarizedExperiment and SingleCellExperiment currently require assays.

from biocpy.summarizedexperiment import SummarizedExperiment
SummarizedExperiment()

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: SummarizedExperiment.__init__() missing 1 required positional argument: 'assays'

MultiAssayExperiment currently requires experiments, col_data, and sample_map.

from biocpy.multiassayexperiment import MultiAssayExperiment
MultiAssayExperiment()

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: MultiAssayExperiment.__init__() missing 3 required positional arguments: 'experiments', 'col_data', and 'sample_map'

Also, what are your thoughts on using "GRanges" instead of "GenomicRanges" for the primary generator name? This would better match current conventions in the R package. Alternatively, we could just add this as an alias to the GenomicRanges generator function.

Best,
Mike

Answer 1 · 2023-12-07T14:46:57.000Z

Also, what's the current recommended method for saving objects to disk? Should we be using pickle?

Answer 2 · 2023-12-08T17:05:10.000Z

Hi @mjsteinbaugh, Thank you for the feedback.

We provide a empty() method in these classes to implement similar behavior. We are currently rewriting some of the classes to adapt to a functional paradigm and can revisit the default values to the arguments in the constructor. e.g.

⋊> ~/P/p/b/GenomicRanges on master ◦ python                             
Python 3.9.17 (main, Jul  5 2023, 15:35:09)
[Clang 14.0.6 ] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.

>>> from genomicranges import GenomicRanges
>>> GenomicRanges.empty()
GenomicRanges(number_of_ranges=0, seqnames=[], 
ranges=IRanges(start=array([], dtype=int32), width=array([], dtype=int32)), 
strand=[], 
mcols=BiocFrame(data={}, number_of_rows=0, column_names=[]), 
seqinfo<genomicranges.SeqInfo.SeqInfo object at 0x100d70d30>)

Also, what's the current recommended method for saving objects to disk? Should we be using pickle?

Regarding saving objects to disk, we are actively developing a system for storing genomic datasets in language-agnostic formats - ArtifactDB, specifically the dolomite package within ArtifactDB is designed for saving Python objects. Instead of relying on pickle/RDS files, one can opt for these formats to store, share, or access datasets across various programming languages. We plan to release a comprehensive tutorial on these features once we reach a stable release. In the meantime, pickle is the way to go.

Answer 3 · 2023-12-08T18:58:31.000Z

Thanks @jkanche I'll check this out. Much appreciated!