Bioconductor/GenomicRanges

is this an expected `gaps` behaviour?

oleksii-nikolaienko opened this issue · 7 comments

Hi,
I wasn't sure if gaps is expected to create non-empty GRanges object from an empty input object. Is the result (containing 2133 genomic ranges) correct?

library(GenomicRanges)
library(BSgenome.Hsapiens.UCSC.hg38)

gaps(GRanges(seqinfo=seqinfo(BSgenome.Hsapiens.UCSC.hg38)))

I'd rather expect that only the seqnames will be complemented, not the entire list of seqlevels. Manual says "gaps returns ... complemented ranges for each distinct (seqname, strand) pairing", which I tend to interpret as ranges for available seqnames only.
Sorry if I misunderstood this.

hpages commented

Good question.

Short answer: Yes this is the expected behavior on a GRanges object when the sequence lengths are known.

Detailed answer:

gaps() behaves differently dependending on whether the seqlengths in its seqinfo are specified or not.

When the seqlengths are not specified:

library(GenomicRanges)

seqinfo <- Seqinfo(c("chr1", "chr2", "chr3"))
seqinfo
# Seqinfo object with 3 sequences from an unspecified genome; no seqlengths:
#   seqnames seqlengths isCircular genome
#   chr1             NA         NA   <NA>
#   chr2             NA         NA   <NA>
#   chr3             NA         NA   <NA>

gr <- GRanges(c("chr1: 281-350", "chr3:55-75"), seqinfo=seqinfo)
gr
# GRanges object with 2 ranges and 0 metadata columns:
#       seqnames    ranges strand
#          <Rle> <IRanges>  <Rle>
#   [1]     chr1   281-350      *
#   [2]     chr3     55-75      *
#   -------
#   seqinfo: 3 sequences from an unspecified genome; no seqlengths

gaps(gr)
# GRanges object with 2 ranges and 0 metadata columns:
#       seqnames    ranges strand
#          <Rle> <IRanges>  <Rle>
#   [1]     chr1     1-280      *
#   [2]     chr3      1-54      *
#   -------
#   seqinfo: 3 sequences from an unspecified genome; no seqlengths

However when the seqlengths are specified, things are quite different. This is because in this case the reference space on top of which the ranges are defined is precisely known. Taking advantage of that knowledge, gaps(gr) is able to return the regions in that reference space that are not covered by gr:

seqlengths(gr) <- c(360, 220, 100)
seqinfo(gr)
# Seqinfo object with 3 sequences from an unspecified genome:
#   seqnames seqlengths isCircular genome
#   chr1            360         NA   <NA>
#   chr2            220         NA   <NA>
#   chr3            100         NA   <NA>

uncovered_regions <- gaps(gr)
uncovered_regions
# GRanges object with 11 ranges and 0 metadata columns:
#        seqnames    ranges strand
#           <Rle> <IRanges>  <Rle>
#    [1]     chr1     1-360      +
#    [2]     chr1     1-360      -
#    [3]     chr1     1-280      *
#    [4]     chr1   351-360      *
#    [5]     chr2     1-220      +
#    [6]     chr2     1-220      -
#    [7]     chr2     1-220      *
#    [8]     chr3     1-100      +
#    [9]     chr3     1-100      -
#   [10]     chr3      1-54      *
#   [11]     chr3    76-100      *
#   -------
#   seqinfo: 3 sequences from an unspecified genome

As you can see, in this case, gaps() returns the ranges of the uncovered regions not only for all the reference sequences, but also for each possible strand value of each reference sequence. The reason for reporting the uncovered regions for strand values +/-/* separately is because this information actually matters when the input GRanges object is stranded i.e. when it has ranges defined on the + and/or - strand. However, in the case of an unstranded GRanges object like in our example above, this is admitedly of limited value and possibly confusing and not convenient to work with.

The good news is that it's easy to fix:

uncovered_regions <- uncovered_regions[strand(uncovered_regions) == "*"]
uncovered_regions
# GRanges object with 4 ranges and 0 metadata columns:
#       seqnames    ranges strand
#          <Rle> <IRanges>  <Rle>
#   [1]     chr1     1-280      *
#   [2]     chr1   351-360      *
#   [3]     chr2     1-220      *
#   [4]     chr3     1-100      *
#   -------
#   seqinfo: 3 sequences from an unspecified genome

Now back to your original question. Why is the chr2:1-220 range part of the result if the input GRanges object has no range on the chr2 sequence? So again, this is because gaps() wants to report all the regions uncovered by the input GRanges object. So chr2:1-220, which is the region that corresponds to the full chr2 sequence, has to be reported. If it was not, then it would mean that the full chr2 sequence is covered by the original GRanges object, like here:

gr2 <- c(gr, GRanges(c("chr2:101-220", "chr2:1-199")))
gr2
# GRanges object with 4 ranges and 0 metadata columns:
#       seqnames    ranges strand
#          <Rle> <IRanges>  <Rle>
#   [1]     chr1   281-350      *
#   [2]     chr3     55-75      -
#   [3]     chr2   101-220      *
#   [4]     chr2     1-199      *
#   -------
#   seqinfo: 3 sequences from an unspecified genome

uncovered_regions2 <- gaps(gr2)
uncovered_regions2 <- uncovered_regions2[strand(uncovered_regions2) == "*"]
uncovered_regions2
# GRanges object with 3 ranges and 0 metadata columns:
#       seqnames    ranges strand
#          <Rle> <IRanges>  <Rle>
#   [1]     chr1     1-280      *
#   [2]     chr1   351-360      *
#   [3]     chr3     1-100      *
#   -------
#   seqinfo: 3 sequences from an unspecified genome

Finally, note that an important property of the gaps() operation is that one should be able to go back to the original GRanges by applying the operation twice. In other words, gaps(gaps(gr)) should bring back gr. This is possible when the seqlengths are known but not if they are not known. This property drove the current behavior and implementation of the gaps() methods.

Hope this helps,
H.

Thanks for an explanation, @hpages, this behavior makes more sense now.

My task was to get all restriction fragments, chromosome by chromosome. And so I noticed that the output depends on when I apply gaps, e.g.

cuts.chr21 <- unlist(matchPDict(PDict("CCGG", algorithm="Twobit"), BSgenome.Hsapiens.UCSC.hg38[["chr21"]]))
cuts.chr22 <- unlist(matchPDict(PDict("CCGG", algorithm="Twobit"), BSgenome.Hsapiens.UCSC.hg38[["chr22"]]))

all.cuts <- c(
  GRanges(seqnames="chr21", ranges=cuts.chr21, seqinfo=seqinfo(BSgenome.Hsapiens.UCSC.hg38)),
  GRanges(seqnames="chr22", ranges=cuts.chr22, seqinfo=seqinfo(BSgenome.Hsapiens.UCSC.hg38))
)
all.gaps <- gaps(all.cuts)

# or

all.gaps <- c(
  gaps(GRanges(seqnames="chr21", ranges=cuts.chr21, seqinfo=seqinfo(BSgenome.Hsapiens.UCSC.hg38))),
  gaps(GRanges(seqnames="chr22", ranges=cuts.chr22, seqinfo=seqinfo(BSgenome.Hsapiens.UCSC.hg38)))
)
hpages commented

I see. Just a note that if you're only matching a single pattern then you should use matchPattern() instead of matchPDict(). The former is going to be about 3x faster in this case.

Thanks. No, the real code has multiple patterns and does it for all contigs, etc

hpages commented

Also, for your use case above, I would do:

si <- seqinfo(BSgenome.Hsapiens.UCSC.hg38)

gaps21 <- gaps(GRanges(seqnames="chr21", ranges=cuts.chr21, seqinfo=si["chr21"]))
gaps21 <- gaps21[strand(gaps21) == "*"]
length(gaps21)  # 35918 gaps on chr21

gaps22 <- gaps(GRanges(seqnames="chr22", ranges=cuts.chr22, seqinfo=si["chr22"]))
gaps22 <- gaps22[strand(gaps22) == "*"]
length(gaps22)  # 59202 gaps on chr22

all.gaps <- c(gaps21, gaps22)

Cheers,
H.

hpages commented

@oleksii-nikolaienko FYI I just added the ignore.strand argument to the gaps() method for GenomicRanges objects (see commit 6ba13b5). This is in GenomicRanges 1.53.3, which should become available via BiocManager::install() in the next 24 hrs or so. Note that it will only become available to BioC 3.18 users (BioC 3.18 is not released yet, but it will be next week).

Thanks, Hervé!