lucene-c-boost: A C++ repository from romank8k

#About

Optimized implementations of certain [Apache Lucene](http://lucene.apache.org) queries in C++ (via JNI) for anywhere from 0 to 7.8X speedup:

                Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
          AndHighLow      467.01      (0.5%)      294.88      (0.2%)  -36.9% ( -37% -  -36%)
              Fuzzy1       62.97      (4.2%)       62.60      (2.0%)   -0.6% (  -6% -    5%)
              Fuzzy2       25.51      (3.2%)       37.39      (2.0%)   46.6% (  40% -   53%)
          AndHighMed       50.43      (0.3%)      106.58      (0.6%)  111.4% ( 110% -  112%)
             LowTerm      299.44      (0.4%)      684.10      (1.7%)  128.5% ( 125% -  130%)
        OrHighNotLow       45.77      (6.5%)      105.45      (0.1%)  130.4% ( 116% -  146%)
        OrHighNotMed       33.23      (6.1%)       87.89      (0.3%)  164.5% ( 148% -  182%)
       OrHighNotHigh        4.90      (6.6%)       14.08      (0.2%)  187.3% ( 169% -  207%)
            Wildcard       17.11      (0.4%)       51.35      (9.5%)  200.1% ( 189% -  210%)
           OrHighMed       18.19      (6.6%)       58.98      (1.4%)  224.2% ( 202% -  248%)
           OrHighLow       15.32      (6.8%)       50.39      (0.7%)  229.0% ( 207% -  253%)
          OrHighHigh        6.48      (6.4%)       21.36      (0.5%)  229.6% ( 209% -  252%)
             MedTerm       69.58      (1.8%)      241.20      (2.3%)  246.7% ( 238% -  255%)
         AndHighHigh       21.98      (0.8%)       77.74      (1.3%)  253.8% ( 249% -  257%)
       OrNotHighHigh       12.90      (7.0%)       45.97      (0.6%)  256.5% ( 232% -  283%)
        OrNotHighMed       25.82      (7.5%)      114.32      (0.8%)  342.7% ( 310% -  379%)
            HighTerm       22.32      (1.8%)      108.60      (0.7%)  386.5% ( 377% -  396%)
             Prefix3       10.13      (0.6%)       55.27      (1.8%)  445.3% ( 440% -  450%)
        OrNotHighLow       61.18      (7.4%)      346.80      (2.4%)  466.9% ( 425% -  514%)
              IntNRQ        4.98      (0.5%)       38.62      (0.6%)  675.6% ( 671% -  680%)

The gains come from 1) code specialization (creating dedicated code to execute exactly one kind of query, with nearly all abstractions removed), and 2) using C++ instead of Java. It's not yet clear how much of the gains are due to each.

The code is fully decoupled from Lucene: it uses Java's reflection APIs to grab the necessary bits for each query.

This is NOT a port of Apache Lucene to C++! Rather, it implements hardcoded C++ code to optimize certain queries. The optimizations are very narrow: they only apply for specific queries.

This is a spinoff from LUCENE-5049. This project is very new and exploratory at this point. Use at your own risk! (But please provide feedback if you do).

#Usage

It's trivial to use; the only API is a public static method:

NativeSearch.search(searcher, query, topN);

If the provided query matches then the optimized C++ code is used. Otherwise the normal Java implementation is used.

#Installation

If you're using Java 1.7 on Linux, just grab the binaries from dist/*.

Otherewise, run python build.py then put dist/*.so on your dynamic library path, and dist/luceneCBoost-SNAPSHOT.jar on your CLASSPATH.

#Limitations

Requires Lucene 4.3.x
Only tested on Linux / x86 CPU so far
Only sort-by-score is supported
Positional queries, and nested BooleanQuery (i.e., a query other than TermQuery as a clause inside BooleanQuery) and Filters are not optimized
Must use the default 4.3 codec and Similarity
Must use the provided NativeMMapDirectory
This code is all very new and likely to have exciting bugs

#WARNING Whenever native code is used from Java, if there are bugs (likely!), or API mis-use (such as closing a searcher while threads are still searcing against it), then the JVM will likely hit a SEGV and the OS will kill it. You have been warned!

romank8k/lucene-c-boost