position-heap is an implementation of position heaps which is inspired by Ross McConnell’s implementation: http://www.cs.colostate.edu/PositionHeaps/. A position heap is a simple and dynamic text indexing data structure.
Andrzej Ehrenfeucht, Ross M. McConnell, Nissa Osheim, Sung-Whan Woo, Position heaps: A simple and dynamic text indexing data structure, J. Discrete Algorithms 9(1): 100-121 (2011).
This project solves the problem of finding all occurrences of a pattern string P in a text T. It will preprocess T (by building a position heap) to speed up successive searches.
Let n denote the length of T, m denote the length of S and k denote the number of occurrences.
The code implements two construction algorithms as follows:
- naiveBuild means a simple algorithm to construct the position heap with O(n^2) worst-case time bound (O(n log n) expected time for a randomly generated text).
- fastBuild means an algorithm to construct the position heap with Θ(n) time bound.
And two search algorithms:
- naiveSearch means a simple search algorithm with O(m^2+k) worst-case time bound. This bound is pessimistic for it takes O(m+k) expected time when either the text or the pattern is randomly generated.
- fastSearch means a construction algorithm with Θ(m+k) time bound. This will do some extra work when constructing the position heap.
Note: The alphabet-size Σ factor is omitted in the time bounds. This project is just a simple implementation, so it uses arrays to simplify codes. You may use balanced binary search trees, hash tables or some other data structures to replace arrays.
For either step (construction and search), you have two choices, so you can combine them freely to get four different programs:
- fastBuildFastSearch
- fastBuildNaiveSearch
- naiveBuildFastSearch
- naiveBuildNaiveSearch
Their meanings are self-explaining.
Each program receives two strings seperated by whitespace representing T and S respectively, and prints out all occurrences of S in T. The leftmost position of S is labeled 0. These positions may not be ascending.
% make fastBuildFastSearch
g++ -g -D LINEAR_BUILD -D LINEAR_SEARCH main.cpp -o fastBuildFastSearch
% ./fastBuildFastSearch
aabcabc
ab
4
1
4 1 is returned by fastBuildFastSearch which means the pattern ab have two occurrences at position 4 and 1 in the text aabcabc.