/intset

Fast, persistent, succinct integer sets.

Primary LanguageHaskellBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

Synopsis

This package provides efficient integer interval sets.

Description

Persistent... is it trees?

Yes, Radix trees. Trees are balanced by prefix bits, so we have fast merge operations, such as union, intersection and difference. Chris Okasaki and Andrew Gill shows that Patricia tree based integer maps might be order of magnitude faster than Red-Black tree counterparts on this operations. The same apply to integer sets, we just have keys, but don't have values.

That does mean the "dense"?

That means we keep suffixes in bitmaps and we might pack, say 10, integers which lies close together in one bitmap. This optimization doesn't affect execution times for sparse sets, but makes dense sets much more memory efficient — near 10-50 times less space usage depending on machine word size and the actual density of the set. Basically, this let us be 3-4 times less memory efficient comparing with arrays of tightly packed bits, but see...

How suffix compaction is performed?

There are exist a pretty simple algorithm used in memory allocators called "buddy memory allocator". In a nutshell, we have a big block which is splitted by half when we remove from one of the half, and merge then back when we insert. It's somewhat inverse to the ordinary tree approach — in buddy tree we hold more information about elements that it doesn't contain, while in prefix tree we hold more information about elements that it does contain. It's easy to guess that we should do with it — take the two structures then fuse them into one to produce a new structure which perform better.

Indeed, the key idea in the design is right here — we switch forth and back between representations per subtree basis. We intersperse different representations in different tree branches. It's like chameleon:

  • If the some subset is sparse, we just keep a radix tree with bitmaps at leafs.

  • If the some subset becomes full we turn it into block. If some buddy block appears, we join the buddy blocks into one. And so forth.

That is, we just get a structure that dynamically choose the optimal representation depending on density of set. Moreover in best case this lead to huge space savings:

> ppStats (fromList [0..123456])

gives:

Bin count: 6
Tip count: 1
Fin count: 6
Size in bytes: 408
Saved space over dense set:  123072
Saved space over bytestring: 11879
Percent saved over dense set:  99.6695821185617%
Percent saved over bytestring: 96.67941727028567%

The ppStats is not an exposed function but you can play with it using cabal-dev ghci.

I don't know if it is an old idea, but this works just fine.

So when this data structure is good choice?

In many situation. It might be used as persistent and compact replacement for bool arrays or Data.IntSet with the following advantages:

  • Purity is extremely useful in multithreaded settings — we could keep a set in a mutable transactional variable or an IORef and atomically update/modify the set. So it could be used as replacement for TArray Int Bool as well.
  • By merging intervals together we achieve compactness. In best case some of main operations will take O(1)time and space, so if you need interval set it's here.
  • Fast serizalization: if you are need conversion to/from bytestrings. Because of bitmaps it's possible to do this conversion extremely fast.

How this implementation relate to containers version?

Heavely based. Essentially we just add the buddy interval compaction, but it turns out that some operations becomes more complicated and requires much more effort to implement — in order to maintain the all tree invariants we need to take into account more cases. This is the reason why some operations are not implemented yet (e.g. lack of views), but I hope I'll fix it with the time.

Documentation

For documentation see haddock generated documentation.

Build Status

Build Status

Maintainer

This library is written and maintained by Sam T. pxqr.sta@gmail.com

Feel free to report bugs and suggestions via github issue tracker or the mail.