[feature request] document counting
Opened this issue · 9 comments
In PrivVG project, document frequencies will be necessary in order to satisfy the differential privacy definition (a single new document might inflate counts a lot if it visits the same edge many times). While FastLocate
is good enough for initial prototyping, it won't scale well.
I noticed there's support for that in GCSA2 (SadaCount
/SadaSparse
), can we please have that in GBWT as well? (cc: @ekg)
I'll have to think this through.
Sadakane's document counting structure belongs to a family of data structures that store information in the internal nodes of the suffix tree. While the final structures can be compressed well, the construction algorithms usually store the data explicitly.
The SadaCount
construction algorithm uses the LCP array to simulate a single-threaded traversal of the suffix tree. It maps each internal node to its inorder rank(s) and stores the information in a single array. This approach works well enough in GCSA2, where the effective text size is ~10^10 for a 1000GP graph. For the GBWT of the same data with the effective text size over 10^12, we would need something much faster and more space-efficient.
The bottleneck seems to be LCP array construction. Assuming that we can build it efficiently, the rest of the construction should not be a problem. Each GBWT node corresponds to a subtree of the root of the suffix tree, so we can traverse the subtrees independently in parallel and compress the intermediate results.
That algorithm is not nearly fast or space-efficient enough. Extrapolating from the results in the paper, LCP array construction for the 1000GP GBWT would require 5-6 days and over 2 TB memory.
The irreducible LCP algorithm (https://doi.org/10.1007/978-3-642-02441-2_17) is more promising. I had an implementation in the RLCSA that required minimal working space on top of the FM-index and the run-length encoded PLCP array. It was reasonably fast with highly repetitive texts, and it should be possible to parallelize it to take advantage of tens of CPU cores.
I believe we would need roughly the following:
- SDSL: Run-length encoded bitvector that can be built incrementally. (We will be dealing with bitvectors with over 10^12 zeros and ones.)
- SDSL / GBWT: PLCP using the RLE bitvector.
- GBWT: PLCP construction from (dynamic?) GBWT using multithreaded irreducible LCP algorithm.
- GBWT:
SadaSparse
(or another variant of Sadakane's document counting structure if it makes more sense). - GBWT: Multithreaded
SadaSparse
construction usingFastLocate
and the PLCP array.
I assume you are referring to "Sampled Longest Common Prefix Array"?
Beller et al's algorithm (queue-based) can be tweaked to use O(#runs)
construction space using very similar ideas, please take a look at section 3 here: https://arxiv.org/pdf/2004.01493.pdf (compare their "incremental relation" and lemma 5 in your paper)
The algorithms of Nishimoto and Tabei are fundementally sequential. The one based on the algorithm by Beller et al. corresponds to a breadth-first traversal of the suffix tree, and it outputs the LCP values at ST node boundaries. I don't see a way of parallelizing that over tens of CPU cores.
In general, stringology research is still mostly concerned about texts that are gigabytes or at most tens of gigabytes in size. The GBWT works with terabyte-scale data, and it often has to resort to simple construction algorithms that can be broken down into a large number of independent tasks.
Fair enough. I briefly checked current lock-free queue landscape, but I don't think they are going to cut it either (best implementations achieve throughput around 10M operations/second in balanced enqueue/dequeue scenarios).
Yes, this one. I also looked at https://github.com/mpoeter/xenium. That said, I only looked at the benchmark results and have no hands-on experience.
Overall I agree that the simplest solution is the best, it would be interesting to check later if more complicated approaches are more performant but it's not at all obvious that they'll be.