Work in progress
sparse_ngrams
is a C++ library that contains a search substring and regexp
algorithms that are scalable for code search indexing and used in GitHub Codesearch. It's indended to
reduce the indexing and query response times compared to zoekt (which is used by Sourcegraph) and Russ Cox's trigram search. The solution is meant to be scalable to billions lines of code with <100ms latency. More on code search project is TBD.
- Easy: First-class, easy to use dependency and carefully documented APIs.
- Fast: We do care about speed of the algorithms and provide reasonable implementations.
- Well tested: We test all algorithms with a unified framework, under sanitizers and fuzzing.
- Benchmarked: We gather benchmarks for all implementations to better understand good and bad spots.
You can use cmake with add_subdirectory
. Includes are in include
,
sources are in src
folders.
We support all C++17 compliant modern compilers (GCC, Clang, MSVC).
To test and benchmark, we use Google benchmark library. Simply do in the root directory:
# Check out the libraries.
$ git clone https://github.com/google/benchmark.git
$ git clone https://github.com/google/googletest.git
$ mkdir build && cd build
$ cmake -DCMAKE_BUILD_TYPE=Release -DSPARSE_NGRAMS_TESTING=on -DBENCHMARK_ENABLE_GTEST_TESTS=off -DBENCHMARK_ENABLE_TESTING=off ..
$ make -j
$ ctest -j4 --output-on-failure
TBD.
The code is made available under the Boost License 1.0.