/interval_dict

Associative array interval library for c++17/20.

Primary LanguageC++Boost Software License 1.0BSL-1.0

interval_dict

An associative array / interval library for c++17/20:

  • Key / value lookup dictionary (like std::map etc.)
  • Associations are only valid over specific intervals. Useful for building time-varying lookup dictionaries.

Overview:

Interval Dictionaries are an extension of the associative containers that allow efficient lookup of Key→Values that are valid only over a specified interval.

Two stage lookup:

  • Stage 1: Search, removal, and insertion operations for key → interval/values have logarithmic complexity (std::map under the hood).
  • Stage 2: Values specified by query interval: Search operations have logarithmic complexity. Different implementations have different complexity for insertions and deletions.

Documentation:

Woefully incomplete documentation here. Better documentation coming soon!

Special features:

1. Chaining lookups

interval dict dictionaries can be "chained" together:

Dictionaries of AB and BC can be combined to a dictionary of AC. The associations will take into account the overlapping intervals for AB and BC

2. Filling interval gaps

Missing data can produce breaks / gaps in associations.

For example, data for companywebsite may have gaps every weekend if the spider only scrapes on weekdays. We can fill gaps by either

  • extending Friday data ("forward filling" by 2 days), or
  • fill in weekend gaps only if Friday and Monday values match.

Gap-filling functions in interval dict support diverse, practical use-cases.

3. Bidirectional lookup

It is often useful to have associative types that support two-way lookup: both key → value and value → key. For a one-off, make an inverse dictionary (interval_dict.invert()).

Other times, it is more convenient or efficient to use a dictionary that supports two-way lookup directly. Under the hood, the bidirectional interval map stores the data in both direction, so there is no savings in space.

License:

Boost Software License

Dependencies

  • Interval handling uses the Boost Interval Container Library (ICL). IntervalDict thus supports many interval variants: open, close, left-open, right-open, dynamic, discrete (dates) and continuous (floats) underlying types etc.

  • Lazy / synchronous iteration of intervals uses co-routines, provided by generator<T> from cppcoro.
    This requires a compiler that supports the coroutines TS (C++17) or C++20.

Supported Compilers

The code is known to work on the following compilers:

  • Linux + Clang 5.0/6.0 + - Linux + Clang 5.0/6.0 + libc++ Build Status

This reflects the requirements of the cppcoro library

Current Status:

interval dict is under active development.

  1. Support open/close/mixed intervals comprising

    Empty intervals are ignored

  2. Two choices of implementation:

    1. IntervalDictICL of interval_dict uses a Boost ICL interval_map of std::set This was straightforward to implement, and stores the underlying data as disjoint intervals for quick lookups.

      Search operations have complexity. However, for insertion/removal of elements, this data structure can result in polynomial or worse complexity .

      This typically is noticeable if there are large number of intervals (>50,000) for a single key. This happens if the associations are very imbalanced (each key has many tens of thousands of values with overlapping validity intervals).

    2. The second implementation IntervalDictITree of interval_dict uses an Interval Tree. This stores the underlying data as continguous non-overlapping intervals.

      Search operations have complexity. However, insertion and deletions can be much more efficient: .

      Implemented using tinloaf/ygg and boost::intrusive, both red-black trees underneath.

  3. Tests 100% coverage

  4. Bidirectional dictionary. This can be parameterised for different implementation in each direction.

In progress:

  1. #f03c15 Benchmarks
    • For successively overlapping intervals
    • For nested intervals (like a pyramid)
    • Simulate a random proportion of values swapping between different keys at successive intervals
  2. #f03c15 Fuzzing tests for implementation 1 vs 2
  3. Abandon ygg and move to boost::intrusive
  4. Write docs
  5. Doxygen for new implementation
  6. Doxygen html for website
  7. Build and integration tests for website

Not yet started :bowtie:

  1. #f03c15 Tutorial

  2. #f03c15 Cython / Python wrapper

  3. #f03c15 Other performant alternative algorithms from bioinformatics.

Build Status

  • Build Status