lemire/FastPFor

Binary Packing

Closed this issue · 23 comments

Without looking at the implementations it is hard to get whats the difference between bitpacking, bitpackingaligned, bitpackingunaligned and blockpacking. Also are there any reference to specific paper for those?

These are technical implementations of bitpacking, to be used as part of binary packing. I am not sure that they matter particularly and I doubt research papers would cover such unexciting technical details.

All of this internal terminology is mostly undocumented, as you point out. We should either simplify it or embark in a documentation effort. Pull requests invited.

I am going to close this as I doubt that there are deep answers to be found.

If you have specific questions, please raise them...

Thanks for the reply. Is there an implementation for the standard Binary Packing introduce in

Vo Ngoc Anh and Alistair Moffat. 2010. Index compression using 64-bit words. Softw. Pract. Exper. 40, 2 (February 2010), 131-147. DOI=http://dx.doi.org/10.1002/spe.v40:2

What is not clear to me regarding the implementation is the following.

  • The original PackedBinary seems to be ByteAlignedPacking with align parameter se to false at https://github.com/lemire/FastPFor/blob/master/headers/blockpacking.h#L230
  • It uses fastunalignedpackwithoutmask_* to write the actual payload
  • fastunalignedpackwithoutmask_16 (as an example) it is supposed to write 16 values with the same bit-length and calls the right function based on the bit-length used
  • in the case of __fastunalignedpackwithoutmask3_16
    byte *__fastunalignedpackwithoutmask3_16(const uint32_t *__restrict__ in,
    for example it will write 16 elements using 3-bits. It will write so 48 bits in two 32-bit words. At the end it will return a 8-bit aligned pointer at the end of the second word just written.

This to me feels like it is doing some alignment every 16 elements. Wasting - in the previous example - 16 bits. Is it correct what I am saying?
If so I can send a PR and fix this :)

Also, correct me if I am wrong, but I feel this is very close to the description of PackedBinary in the paper i cited above, even though it is not the exact same thing.
The all HowManyMiniBlocks and MiniBlockSize thing is never mentioned in the original paper. It is a cool idea, but I feel a fair implementation should be totally unaligned and respect the single-header idea in the paper. What do you think?

@amallia

Is there an implementation for the standard Binary Packing introduce in (...)

The reference for the implementation in the current C++ library is...

Daniel Lemire and Leonid Boytsov, Decoding billions of integers per second through vectorization, Software Practice & Experience 45 (1), 2015. https://arxiv.org/abs/1209.2137

It cites Anh and Moffat (2010). To my knowledge, Anh and Moffat have not made available the details of their implementation, so it is not possible for me to tell exactly what they did. The broad idea is the same.

This to me feels like it is doing some alignment every 16 elements. Wasting - in the previous example - 16 bits. Is it correct what I am saying? (...) If so I can send a PR and fix this :)

This seems correct, yes. Pull requests are always invited.

I feel a fair implementation should be totally unaligned and respect the single-header idea in the paper. What do you think?

I am not quite sure what you mean by "fair". We describe binary packing in section 2.6 and our implementation in section 6.2 of our 2015 paper (https://arxiv.org/pdf/1209.2137.pdf).

Hence, to keep blocks aligned on 32-bit boundaries, we group the blocks and respective descriptors into meta-blocks each of which contains 4 successive blocks. A meta-block is preceded by a 32-bit descriptor that combines 4 bit widths b (8 bits per width). We call this scheme BP32. We also experimented with versions of binary packing on fewer integers (8 integers and 16 integers). Because these versions are slower, we omit them from our experiments.

This C++ library (FastPFor) does not take Anh and Moffat (2010) as the reference for implementations. It is the library we built to support Lemire and Boytsov (2015). Though we credit Anh and Moffat throughout, we make no claim that our implementations are a match for what Anh and Moffat used. Rather Anh and Moffat provided ideas and principles. We built our own implementations and designs.

Understood! Thanks a lot for all the clarifications.

My recollection is that @lemire noticed how fast bitpacking was in one of the prior papers, but authors of the paper apparently didn't pay much attention to this fact and didn't explore this direction any further. The idea of binary packing was certainly used by other folks (we did cite them), but we pushed it a bit further with a different layout, fast integrated decoding, and more careful/accurate evaluation (we test both decoding and prefix sum computation unlike some of the work that ignores prefix sum computation cost). The overall result is that the decoding speed is roughly the same as the speed of memcpy.

Thanks @searchivarius for the supporting evidence.

One important point, @amallia, is the important of what we call "integrated differential coding". Most prior work makes little case of differential coding as a computational task. We found and reported that integrating differential coding within the compression algorithm improved matter considerably. We have a follow-up paper that I need to plug...

Daniel Lemire, Nathan Kurz, Leonid Boytsov, SIMD Compression and the Intersection of Sorted Integers, Software: Practice and Experience 46 (6), 2016 https://arxiv.org/abs/1401.6399

I would add, so that there is no misunderstanding, that I am big fan of Anh and Moffat (2010). There is a reason why I have been citing this paper in code and papers for years.

Yes @searchivarius I agree with the fact that bit/binary-packing is quite fast :)
In particular for linear decoding, might be true for skipping too but with a different layout. What is your email address? we can talk about it if there is an interest.

Anyway, I initially wanted to understand if the paper from Anh and Moffat had an implementation in this library. They use a different layout which might produce different decoding speed numbers. I might implement it.

At this point, my question is: If you group - say - 16 headers together, why not just grouping all the headers at the very beginning followed by on the fixed-length encoded elements?

@amallia you need to refer to the original header, I guess. This may cost you (but I haven't tried): would it still be in the cache when you keep decoding farther and farther memory pieces.

yeah, if you have any questions, feel, free to send an e-mail to leo at boytsov.info

why not just grouping all the headers at the very beginning followed by on the fixed-length encoded elements?

Sure. My belief is that we tested out this possibility at one point (Nathan Kurz was concerned about these layout issues) and found that it was not obviously beneficial. This could be revisited, naturally.

@searchivarius I mean, right now you are grouping 16 mini-headers. So if you need to access specific blocks you probably have 2 cache misses. First one for the header, second one when you jump to the right block. While if you store everything in a single initial header, your first cache miss will bring up many more mini-headers, then you will only pay another cache miss for every block you want to decode.

On the other hand, turns out that if you use this encoding for block-based inverted index and you use 128 elements block, then 128/16 is exactly 8 so it might not make any difference in practice :)

Final question. In "Decoding billions of integers per second through vectorization" you say:

Three factors determine to the storage cost of a given block in binary packing:

  • the number of bits (b) required to store the largest integer value in the binary notation,
  • the block length (B),
  • and a fixed per-block overhead (κ).

What is κ in the FastPFor implementation? Using the multi-header optimization k should always be equal to 0 from my understanding.

So if you need to access specific blocks you probably have 2 cache misses.

If you are interested in random access into compressed data, you might enjoy the following paper...

Daniel Lemire, Christoph Rupp, Efficient Integer-Key Compression in a Key-Value Store using SIMD Instructions, Information Systems 66, 2017 https://arxiv.org/abs/1611.05428

What is κ in the FastPFor implementation? Using the multi-header optimization k should always be equal to 0 from my understanding.

You need to store the bit width. This uses 8 bits in most of our implementations.

@amallia when you access data sequentially prefetching may reduce the cost of the second cache miss IMHO.

Unless you have an auxiliary data structure, or the bit width "b" is fixed, then you cannot randomly access a specific block. If the bit width is fixed, the whole discussion is moot. If you have an auxiliary data structure, then it is a whole other story.

My understanding is that we have
|h1 h2 h3 h4 h5 h6|p1 p2 p3 p4 p5 p6 p7 p8|
where h1 is the header of p1 containing b (bit-length used to encode p1). p1 is the encoding of 16 elements using b bits for each of them.

Once you read |h1 h2 h3 h4 h5 h6|, you can easily skip to p6 since you know number of elements and bit-length. This specific skip will cost very likely 2 cache misses.

@amallia when you access data sequentially prefetching may reduce the cost of the second cache miss IMHO.
Not sure this is always true. It really depends how far you want to skip. What you are saying might be true for 1-2 cache lines. Right?

@amallia ok, a test could tell this, but my point is that binary packing for inverted files is actually super fast. Everybody is working on this problem and it has been a bit overresearched. However, if you are interested in compression, I have a much less researched compression problem with potentially a much bigger impact. Actually, it is even two problems. :-)

I sent you an email.

Once you read |h1 h2 h3 h4 h5 h6|, you can easily skip to p6 since you know number of elements and bit-length. This specific skip will cost very likely 2 cache misses.

Interesting.

There are two scenarios that I have in mind.

If you are not using differential coding, then you are correct.

But if you are using differential coding... then due to differential coding, you need to know the prefix sum of all the values up to that point if you want to skip ahead. So you need some kind of skipping data structure.