/sparse_ngrams

Search index algorithm for GitHub code search

Primary LanguageC++MIT LicenseMIT

sparse_ngrams: GitHub code search indexing

Work in progress

sparse_ngrams is a C++ library that contains a search substring and regexp algorithms that are scalable for code search indexing and used in GitHub Codesearch. It's indended to reduce the indexing and query response times compared to zoekt (which is used by Sourcegraph) and Russ Cox's trigram search. The solution is meant to be scalable to billions lines of code with <100ms latency. More on code search project is TBD.

  • Easy: First-class, easy to use dependency and carefully documented APIs.
  • Fast: We do care about speed of the algorithms and provide reasonable implementations.
  • Well tested: We test all algorithms with a unified framework, under sanitizers and fuzzing.
  • Benchmarked: We gather benchmarks for all implementations to better understand good and bad spots.

Table of Contents

Quick Start

You can use cmake with add_subdirectory. Includes are in include, sources are in src folders.

We support all C++17 compliant modern compilers (GCC, Clang, MSVC).

Testing

To test and benchmark, we use Google benchmark library. Simply do in the root directory:

# Check out the libraries.
$ git clone https://github.com/google/benchmark.git
$ git clone https://github.com/google/googletest.git
$ mkdir build && cd build
$ cmake -DCMAKE_BUILD_TYPE=Release -DSPARSE_NGRAMS_TESTING=on -DBENCHMARK_ENABLE_GTEST_TESTS=off -DBENCHMARK_ENABLE_TESTING=off ..
$ make -j
$ ctest -j4 --output-on-failure

Documentation

TBD.

License

The code is made available under the Boost License 1.0.