/lucene-c-boost

Implements common Lucene queries in C via JNI for Sizable (2-3X) speedups in certain cases

Primary LanguageC++Apache License 2.0Apache-2.0

#About

Optimized implementations of certain [Apache Lucene](http://lucene.apache.org) queries in C++ (via JNI) for anywhere from 0 to 7.8X speedup:

                Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
          AndHighLow      467.01      (0.5%)      294.88      (0.2%)  -36.9% ( -37% -  -36%)
              Fuzzy1       62.97      (4.2%)       62.60      (2.0%)   -0.6% (  -6% -    5%)
              Fuzzy2       25.51      (3.2%)       37.39      (2.0%)   46.6% (  40% -   53%)
          AndHighMed       50.43      (0.3%)      106.58      (0.6%)  111.4% ( 110% -  112%)
             LowTerm      299.44      (0.4%)      684.10      (1.7%)  128.5% ( 125% -  130%)
        OrHighNotLow       45.77      (6.5%)      105.45      (0.1%)  130.4% ( 116% -  146%)
        OrHighNotMed       33.23      (6.1%)       87.89      (0.3%)  164.5% ( 148% -  182%)
       OrHighNotHigh        4.90      (6.6%)       14.08      (0.2%)  187.3% ( 169% -  207%)
            Wildcard       17.11      (0.4%)       51.35      (9.5%)  200.1% ( 189% -  210%)
           OrHighMed       18.19      (6.6%)       58.98      (1.4%)  224.2% ( 202% -  248%)
           OrHighLow       15.32      (6.8%)       50.39      (0.7%)  229.0% ( 207% -  253%)
          OrHighHigh        6.48      (6.4%)       21.36      (0.5%)  229.6% ( 209% -  252%)
             MedTerm       69.58      (1.8%)      241.20      (2.3%)  246.7% ( 238% -  255%)
         AndHighHigh       21.98      (0.8%)       77.74      (1.3%)  253.8% ( 249% -  257%)
       OrNotHighHigh       12.90      (7.0%)       45.97      (0.6%)  256.5% ( 232% -  283%)
        OrNotHighMed       25.82      (7.5%)      114.32      (0.8%)  342.7% ( 310% -  379%)
            HighTerm       22.32      (1.8%)      108.60      (0.7%)  386.5% ( 377% -  396%)
             Prefix3       10.13      (0.6%)       55.27      (1.8%)  445.3% ( 440% -  450%)
        OrNotHighLow       61.18      (7.4%)      346.80      (2.4%)  466.9% ( 425% -  514%)
              IntNRQ        4.98      (0.5%)       38.62      (0.6%)  675.6% ( 671% -  680%)

The gains come from 1) code specialization (creating dedicated code to execute exactly one kind of query, with nearly all abstractions removed), and 2) using C++ instead of Java. It's not yet clear how much of the gains are due to each.

The code is fully decoupled from Lucene: it uses Java's reflection APIs to grab the necessary bits for each query.

This is NOT a port of Apache Lucene to C++! Rather, it implements hardcoded C++ code to optimize certain queries. The optimizations are very narrow: they only apply for specific queries.

This is a spinoff from LUCENE-5049. This project is very new and exploratory at this point. Use at your own risk! (But please provide feedback if you do).


#Usage

It's trivial to use; the only API is a public static method:

NativeSearch.search(searcher, query, topN);

If the provided query matches then the optimized C++ code is used. Otherwise the normal Java implementation is used.


#Installation

If you're using Java 1.7 on Linux, just grab the binaries from dist/*.

Otherewise, run python build.py then put dist/*.so on your dynamic library path, and dist/luceneCBoost-SNAPSHOT.jar on your CLASSPATH.


#Limitations
  • Requires Lucene 4.3.x
  • Only tested on Linux / x86 CPU so far
  • Only sort-by-score is supported
  • Positional queries, and nested BooleanQuery (i.e., a query other than TermQuery as a clause inside BooleanQuery) and Filters are not optimized
  • Must use the default 4.3 codec and Similarity
  • Must use the provided NativeMMapDirectory
  • This code is all very new and likely to have exciting bugs

#WARNING Whenever native code is used from Java, if there are bugs (likely!), or API mis-use (such as closing a searcher while threads are still searcing against it), then the JVM will likely hit a SEGV and the OS will kill it. You have been warned!