chaimleib/intervaltree

support for null intervals

Closed this issue · 5 comments

it should be nice to support null intervals, like [4, 4]

There have been many requests to include points within the range search. I tried doing this once, but the algorithms for traversing and modifying the tree really depend on intervals containing some range. So, if you would like to include 4 in your interval tree, you could add the interval [4, 5), which would start at 4 inclusively and end at 5 without including it, which is how all Intervals work in the IntervalTree. They include the start value, but not the end value.

This behavior is analogous to string or array indexing:

>>> s = "abcdef"
>>> s[4:5]
"e"
>>> s[4:4]
""

>>> a = [0, 1, 2, 3, 4, 5]
>>> a[4:5]
[4]
>>> a[4:4]
[]

If your range limits are not restricted to discrete values, for example, they can be reals with fractional parts, then for now, I would suggest coding around sortedcontainers.SortedDict. This is how I would like to add point support in a future release, but to a new class with a different name, since this is different in behavior from the current interval tree.

in the case where left is included and right is excluded a position described as i:i would refer to the location between i-1 and i, not including either i-1 or i themselves.

to give context for this, a description of an insertion mutation in a genome would refer to a position between nucleotides as the location of origin for the variant, because that "range" doesn't exist in the reference sequence, but rather has a 'location' that lies between existing positions

such a position would NOT be encompassed by ranges that touch either side of it, but would be if they crossed over it.

lets take position 3:3 for instance. It is a "between" position that lies in the potential insertion gap between 2 and 3, not including 2 OR 3. It would NOT overlap with either 1:3 (which technically ends at 2 if we're dealing only with integers) or even with 3:6 (because this range begins at 3 itself, not the gap between 2 and 3).

2:3 would overlap it, 3:3 explicitly would as well.

In order to support using floats and other non-integer ranges, Intervals include the lower bound, but not the upper bound, so Interval(2, 3) is 2 ≤ x < 3. This is a deep dependency in the search, insertion and balancing algorithms, and other methods of IntervalTree.

If you tried to create Interval(3, 3), you would get 3 ≤ x < 3. There is no x that satisfies this inequality, since x = 3 won't satisfy x < 3. The interval includes no points, and is equivalent to the null set.

If I were to make a new kind of tree where point searches did not include boundaries, weird and awkward behavior would result. For example, if I asked it "What events do I have at 3 o'clock?", an event beginning at 3 would not appear.

There do exist other libraries that support all types of intervals, inclusive or not inclusive of either the lower or upper bound. However, the added flexibility results in a very complicated API and even more complicated internal logic for dealing with intervals. I opted to only support half-open intervals, as defined above.

It is useful for me to have null intervals in the context I am using the intervaltree. I am making an interval tree of CIDR format IP addresses. 1.0.254.0/23 for example. There is a degenerate case of 1.0.254.0/32 that boils down to a single IP address rather than a range.

I did read your first reply and bumped the upper range of the degenerate case to be [lower:lower+1] to set the interval.

Handling of null intervals is a better fit conceptually where the intervals represent unit positions in data sets.
My use-case is in natural language processing where sequences of words in text are labelled. Often you want to assess if annotations in running text overlap. Any text dataset has a lot of one-word length annotations. [Word spans] are typically represented by intervals of word positions, e.g. [0,1] = [Word spans]. As mentioned, you can transform the positional representation to boundary representation by "end_position + 1" in combination with exclusive right boundary matching.
However the programmer has to always be cognizant of this representative transformation.

I used to use Sympy for text problems because it is a better fit conceptually as it casts null intervals to FiniteSet objects. But Sympy is way too slow and memory intensive on large datasets. Intervaltree.py is fast and light, so I will keep using this package over Sympy.