Should `GenomicRanges` and `GenomicRangesList` check that the row_names of `mcols` is the same as the name of the ranges in the parent object?
Opened this issue · 4 comments
Currently, GenomicRanges
and GenomicRangesList
only checks that the mcols
BiocFrame
has the same number of rows as there are ranges. However, it seems prudent to also ensure that mcols
has the same row_names
as the names of the ranges in the parent object. The lack of this guarantee makes it impossible to select mcols rows from a subset of ranges rows and many other operations. This also applies to the names of the IRanges
object in GenomicRanges
.
If I understand correctly, R does not enforce this either. Users are allowed to modify row names, and there are no built-in checks to ensure that the row names of IRanges and GenomicRanges objects match. It is assumed that there is a 1-to-1 index mapping between these two structures.
I did not look at the Bioconductor code code before this. What I see is that GRanges
prevents mcols
from having rownames
: https://code.bioconductor.org/browse/GenomicRanges/blob/RELEASE_3_19/R/GRanges-class.R#L55. Probably something that BiocPy
should also enforce? Also the ranges
is prevented from having mcols
but I am unsure if that is relevant: https://code.bioconductor.org/browse/GenomicRanges/blob/RELEASE_3_19/R/GRanges-class.R#L48
Weirdly, it doesn't seem to prevent creation of a genomic ranges object. It seems to elevate the mcols of iranges and merge it with genomicranges. I think this is confusing when objects magically change things and I prefer explicit construction of these objects.
> gr <- GRanges(
+ seqnames = c("chr1", "chr1"),
+ ranges = IRanges(c(2, 9), width=c(6, 11), mcols=DataFrame(a = 1:2)),
+ strand = c("+", "-"))
> gr
GRanges object with 2 ranges and 1 metadata column:
seqnames ranges strand | mcols.a
<Rle> <IRanges> <Rle> | <integer>
[1] chr1 2-7 + | 1
[2] chr1 9-19 - | 2
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
> ranges(gr)
IRanges object with 2 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
[1] 2 7 6
[2] 9 19 11
Transformation stuff happens here: https://code.bioconductor.org/browse/GenomicRanges/blob/RELEASE_3_19/R/GRanges-class.R#L187 through line 197.
I don't disagree, its just handling these differences when trying to read from or write to disk is very challenging.