BiocPy/GenomicRanges

Should `GenomicRanges` and `GenomicRangesList` check that the row_names of `mcols` is the same as the name of the ranges in the parent object?

Opened this issue · 4 comments

Currently, GenomicRanges and GenomicRangesList only checks that the mcols BiocFrame has the same number of rows as there are ranges. However, it seems prudent to also ensure that mcols has the same row_names as the names of the ranges in the parent object. The lack of this guarantee makes it impossible to select mcols rows from a subset of ranges rows and many other operations. This also applies to the names of the IRanges object in GenomicRanges.

If I understand correctly, R does not enforce this either. Users are allowed to modify row names, and there are no built-in checks to ensure that the row names of IRanges and GenomicRanges objects match. It is assumed that there is a 1-to-1 index mapping between these two structures.

I did not look at the Bioconductor code code before this. What I see is that GRanges prevents mcols from having rownames: https://code.bioconductor.org/browse/GenomicRanges/blob/RELEASE_3_19/R/GRanges-class.R#L55. Probably something that BiocPy should also enforce? Also the ranges is prevented from having mcols but I am unsure if that is relevant: https://code.bioconductor.org/browse/GenomicRanges/blob/RELEASE_3_19/R/GRanges-class.R#L48

Weirdly, it doesn't seem to prevent creation of a genomic ranges object. It seems to elevate the mcols of iranges and merge it with genomicranges. I think this is confusing when objects magically change things and I prefer explicit construction of these objects.

> gr <- GRanges(
+   seqnames = c("chr1", "chr1"), 
+   ranges = IRanges(c(2, 9), width=c(6, 11), mcols=DataFrame(a = 1:2)),
+   strand = c("+", "-"))
> gr
GRanges object with 2 ranges and 1 metadata column:
      seqnames    ranges strand |   mcols.a
         <Rle> <IRanges>  <Rle> | <integer>
  [1]     chr1       2-7      + |         1
  [2]     chr1      9-19      - |         2
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

> ranges(gr)
IRanges object with 2 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]         2         7         6
  [2]         9        19        11

Transformation stuff happens here: https://code.bioconductor.org/browse/GenomicRanges/blob/RELEASE_3_19/R/GRanges-class.R#L187 through line 197.

I don't disagree, its just handling these differences when trying to read from or write to disk is very challenging.