is this an expected `gaps` behaviour?
oleksii-nikolaienko opened this issue · 7 comments
Hi,
I wasn't sure if gaps
is expected to create non-empty GRanges
object from an empty input object. Is the result (containing 2133 genomic ranges) correct?
library(GenomicRanges)
library(BSgenome.Hsapiens.UCSC.hg38)
gaps(GRanges(seqinfo=seqinfo(BSgenome.Hsapiens.UCSC.hg38)))
I'd rather expect that only the seqnames
will be complemented, not the entire list of seqlevels
. Manual says "gaps
returns ... complemented ranges for each distinct (seqname, strand) pairing", which I tend to interpret as ranges for available seqnames
only.
Sorry if I misunderstood this.
Good question.
Short answer: Yes this is the expected behavior on a GRanges object when the sequence lengths are known.
Detailed answer:
gaps()
behaves differently dependending on whether the seqlengths in its seqinfo are specified or not.
When the seqlengths are not specified:
library(GenomicRanges)
seqinfo <- Seqinfo(c("chr1", "chr2", "chr3"))
seqinfo
# Seqinfo object with 3 sequences from an unspecified genome; no seqlengths:
# seqnames seqlengths isCircular genome
# chr1 NA NA <NA>
# chr2 NA NA <NA>
# chr3 NA NA <NA>
gr <- GRanges(c("chr1: 281-350", "chr3:55-75"), seqinfo=seqinfo)
gr
# GRanges object with 2 ranges and 0 metadata columns:
# seqnames ranges strand
# <Rle> <IRanges> <Rle>
# [1] chr1 281-350 *
# [2] chr3 55-75 *
# -------
# seqinfo: 3 sequences from an unspecified genome; no seqlengths
gaps(gr)
# GRanges object with 2 ranges and 0 metadata columns:
# seqnames ranges strand
# <Rle> <IRanges> <Rle>
# [1] chr1 1-280 *
# [2] chr3 1-54 *
# -------
# seqinfo: 3 sequences from an unspecified genome; no seqlengths
However when the seqlengths are specified, things are quite different. This is because in this case the reference space on top of which the ranges are defined is precisely known. Taking advantage of that knowledge, gaps(gr)
is able to return the regions in that reference space that are not covered by gr
:
seqlengths(gr) <- c(360, 220, 100)
seqinfo(gr)
# Seqinfo object with 3 sequences from an unspecified genome:
# seqnames seqlengths isCircular genome
# chr1 360 NA <NA>
# chr2 220 NA <NA>
# chr3 100 NA <NA>
uncovered_regions <- gaps(gr)
uncovered_regions
# GRanges object with 11 ranges and 0 metadata columns:
# seqnames ranges strand
# <Rle> <IRanges> <Rle>
# [1] chr1 1-360 +
# [2] chr1 1-360 -
# [3] chr1 1-280 *
# [4] chr1 351-360 *
# [5] chr2 1-220 +
# [6] chr2 1-220 -
# [7] chr2 1-220 *
# [8] chr3 1-100 +
# [9] chr3 1-100 -
# [10] chr3 1-54 *
# [11] chr3 76-100 *
# -------
# seqinfo: 3 sequences from an unspecified genome
As you can see, in this case, gaps()
returns the ranges of the uncovered regions not only for all the reference sequences, but also for each possible strand value of each reference sequence. The reason for reporting the uncovered regions for strand values +
/-
/*
separately is because this information actually matters when the input GRanges object is stranded i.e. when it has ranges defined on the +
and/or -
strand. However, in the case of an unstranded GRanges object like in our example above, this is admitedly of limited value and possibly confusing and not convenient to work with.
The good news is that it's easy to fix:
uncovered_regions <- uncovered_regions[strand(uncovered_regions) == "*"]
uncovered_regions
# GRanges object with 4 ranges and 0 metadata columns:
# seqnames ranges strand
# <Rle> <IRanges> <Rle>
# [1] chr1 1-280 *
# [2] chr1 351-360 *
# [3] chr2 1-220 *
# [4] chr3 1-100 *
# -------
# seqinfo: 3 sequences from an unspecified genome
Now back to your original question. Why is the chr2:1-220
range part of the result if the input GRanges object has no range on the chr2
sequence? So again, this is because gaps()
wants to report all the regions uncovered by the input GRanges object. So chr2:1-220
, which is the region that corresponds to the full chr2
sequence, has to be reported. If it was not, then it would mean that the full chr2
sequence is covered by the original GRanges object, like here:
gr2 <- c(gr, GRanges(c("chr2:101-220", "chr2:1-199")))
gr2
# GRanges object with 4 ranges and 0 metadata columns:
# seqnames ranges strand
# <Rle> <IRanges> <Rle>
# [1] chr1 281-350 *
# [2] chr3 55-75 -
# [3] chr2 101-220 *
# [4] chr2 1-199 *
# -------
# seqinfo: 3 sequences from an unspecified genome
uncovered_regions2 <- gaps(gr2)
uncovered_regions2 <- uncovered_regions2[strand(uncovered_regions2) == "*"]
uncovered_regions2
# GRanges object with 3 ranges and 0 metadata columns:
# seqnames ranges strand
# <Rle> <IRanges> <Rle>
# [1] chr1 1-280 *
# [2] chr1 351-360 *
# [3] chr3 1-100 *
# -------
# seqinfo: 3 sequences from an unspecified genome
Finally, note that an important property of the gaps()
operation is that one should be able to go back to the original GRanges by applying the operation twice. In other words, gaps(gaps(gr))
should bring back gr
. This is possible when the seqlengths are known but not if they are not known. This property drove the current behavior and implementation of the gaps()
methods.
Hope this helps,
H.
Thanks for an explanation, @hpages, this behavior makes more sense now.
My task was to get all restriction fragments, chromosome by chromosome. And so I noticed that the output depends on when I apply gaps, e.g.
cuts.chr21 <- unlist(matchPDict(PDict("CCGG", algorithm="Twobit"), BSgenome.Hsapiens.UCSC.hg38[["chr21"]]))
cuts.chr22 <- unlist(matchPDict(PDict("CCGG", algorithm="Twobit"), BSgenome.Hsapiens.UCSC.hg38[["chr22"]]))
all.cuts <- c(
GRanges(seqnames="chr21", ranges=cuts.chr21, seqinfo=seqinfo(BSgenome.Hsapiens.UCSC.hg38)),
GRanges(seqnames="chr22", ranges=cuts.chr22, seqinfo=seqinfo(BSgenome.Hsapiens.UCSC.hg38))
)
all.gaps <- gaps(all.cuts)
# or
all.gaps <- c(
gaps(GRanges(seqnames="chr21", ranges=cuts.chr21, seqinfo=seqinfo(BSgenome.Hsapiens.UCSC.hg38))),
gaps(GRanges(seqnames="chr22", ranges=cuts.chr22, seqinfo=seqinfo(BSgenome.Hsapiens.UCSC.hg38)))
)
I see. Just a note that if you're only matching a single pattern then you should use matchPattern()
instead of matchPDict()
. The former is going to be about 3x faster in this case.
Thanks. No, the real code has multiple patterns and does it for all contigs, etc
Also, for your use case above, I would do:
si <- seqinfo(BSgenome.Hsapiens.UCSC.hg38)
gaps21 <- gaps(GRanges(seqnames="chr21", ranges=cuts.chr21, seqinfo=si["chr21"]))
gaps21 <- gaps21[strand(gaps21) == "*"]
length(gaps21) # 35918 gaps on chr21
gaps22 <- gaps(GRanges(seqnames="chr22", ranges=cuts.chr22, seqinfo=si["chr22"]))
gaps22 <- gaps22[strand(gaps22) == "*"]
length(gaps22) # 59202 gaps on chr22
all.gaps <- c(gaps21, gaps22)
Cheers,
H.
@oleksii-nikolaienko FYI I just added the ignore.strand
argument to the gaps()
method for GenomicRanges objects (see commit 6ba13b5). This is in GenomicRanges 1.53.3, which should become available via BiocManager::install()
in the next 24 hrs or so. Note that it will only become available to BioC 3.18 users (BioC 3.18 is not released yet, but it will be next week).
Thanks, Hervé!