[feature request] document counting

Question

[feature request] document counting

Opened this issue 4 years ago · 9 comments

In PrivVG project, document frequencies will be necessary in order to satisfy the differential privacy definition (a single new document might inflate counts a lot if it visits the same edge many times). While FastLocate is good enough for initial prototyping, it won't scale well.

I noticed there's support for that in GCSA2 (SadaCount/SadaSparse), can we please have that in GBWT as well? (cc: @ekg)

Answer 1 · 2021-01-30T02:11:13.000Z

I'll have to think this through.

Sadakane's document counting structure belongs to a family of data structures that store information in the internal nodes of the suffix tree. While the final structures can be compressed well, the construction algorithms usually store the data explicitly.

The SadaCount construction algorithm uses the LCP array to simulate a single-threaded traversal of the suffix tree. It maps each internal node to its inorder rank(s) and stores the information in a single array. This approach works well enough in GCSA2, where the effective text size is ~10^10 for a 1000GP graph. For the GBWT of the same data with the effective text size over 10^12, we would need something much faster and more space-efficient.

The bottleneck seems to be LCP array construction. Assuming that we can build it efficiently, the rest of the construction should not be a problem. Each GBWT node corresponds to a subtree of the root of the suffix tree, so we can traverse the subtrees independently in parallel and compress the intermediate results.

Answer 2 · 2021-01-30T20:29:52.000Z

Thank you for the swift response! Yes, I now see that in case of GCSA2, both alphabet size and suffix lengths are much smaller compared to GBWT. Luckily, the last author of the r-index paper (https://arxiv.org/pdf/1809.02792v2.pdf) has also addressed efficient LCP construction: https://arxiv.org/pdf/1901.05226.pdf. My understanding is that Belazzougui’s algorithm should work well with GBWT (there, σ^2 looks scary but it's really <global σ> * <local σ>), and once LCP is built, it can be sampled similarly to SA to use only O(#runs) space, as per Lemma 5 in the r-index paper.

…

On Sat, Jan 30, 2021, at 3:11 AM, Jouni Siren wrote: I'll have to think this through. Sadakane's document counting structure belongs to a family of data structures that store information in the internal nodes of the suffix tree. While the final structures can be compressed well, the construction algorithms usually store the data explicitly. The `SadaCount` construction algorithm uses the LCP array to simulate a single-threaded traversal of the suffix tree. It maps each internal node to its inorder rank(s) and stores the information in a single array. This approach works well enough in GCSA2, where the effective text size is ~10^10 for a 1000GP graph. For the GBWT of the same data with the effective text size over 10^12, we would need something much faster and more space-efficient. The bottleneck seems to be LCP array construction. Assuming that we can build it efficiently, the rest of the construction should not be a problem. Each GBWT node corresponds to a subtree of the root of the suffix tree, so we can traverse the subtrees independently in parallel and compress the intermediate results. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#7 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFBYG2ZEJLHXIY7SVQGWFDS4NTE3ANCNFSM4WZEKZDQ>.

Answer 3 · 2021-02-02T02:05:33.000Z

That algorithm is not nearly fast or space-efficient enough. Extrapolating from the results in the paper, LCP array construction for the 1000GP GBWT would require 5-6 days and over 2 TB memory.

The irreducible LCP algorithm (https://doi.org/10.1007/978-3-642-02441-2_17) is more promising. I had an implementation in the RLCSA that required minimal working space on top of the FM-index and the run-length encoded PLCP array. It was reasonably fast with highly repetitive texts, and it should be possible to parallelize it to take advantage of tens of CPU cores.

Answer 4 · 2021-02-02T06:41:49.000Z

I believe we would need roughly the following:

SDSL: Run-length encoded bitvector that can be built incrementally. (We will be dealing with bitvectors with over 10^12 zeros and ones.)
SDSL / GBWT: PLCP using the RLE bitvector.
GBWT: PLCP construction from (dynamic?) GBWT using multithreaded irreducible LCP algorithm.
GBWT: SadaSparse (or another variant of Sadakane's document counting structure if it makes more sense).
GBWT: Multithreaded SadaSparse construction using FastLocate and the PLCP array.

Answer 5 · 2021-02-02T19:24:56.000Z

I assume you are referring to "Sampled Longest Common Prefix Array"?

Beller et al's algorithm (queue-based) can be tweaked to use O(#runs) construction space using very similar ideas, please take a look at section 3 here: https://arxiv.org/pdf/2004.01493.pdf (compare their "incremental relation" and lemma 5 in your paper)

Answer 6 · 2021-02-03T01:16:04.000Z

The algorithms of Nishimoto and Tabei are fundementally sequential. The one based on the algorithm by Beller et al. corresponds to a breadth-first traversal of the suffix tree, and it outputs the LCP values at ST node boundaries. I don't see a way of parallelizing that over tens of CPU cores.

In general, stringology research is still mostly concerned about texts that are gigabytes or at most tens of gigabytes in size. The GBWT works with terabyte-scale data, and it often has to resort to simple construction algorithms that can be broken down into a large number of independent tasks.

Answer 7 · 2021-02-03T09:22:50.000Z

Fair enough. I briefly checked current lock-free queue landscape, but I don't think they are going to cut it either (best implementations achieve throughput around 10M operations/second in balanced enqueue/dequeue scenarios).

Answer 8 · 2021-02-03T09:26:36.000Z

I've been using this one: https://github.com/max0x7ba/atomic_queue This is what you tested?

…

On Wed, Feb 3, 2021 at 10:23 AM Artem Tarasov ***@***.***> wrote: Fair enough. I briefly checked current lock-free queue landscape, but I don't think they are going to cut it either (best implementations achieve throughput around 10M operations/second in balanced enqueue/dequeue scenarios). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABDQEKHKDFFOVZXAEP5RY3S5EIXZANCNFSM4WZEKZDQ> .

Answer 9 · 2021-02-03T09:34:59.000Z

Yes, this one. I also looked at https://github.com/mpoeter/xenium. That said, I only looked at the benchmark results and have no hands-on experience.

Overall I agree that the simplest solution is the best, it would be interesting to check later if more complicated approaches are more performant but it's not at all obvious that they'll be.