/fdblucene

Lucene / FoundationDB integration library

Primary LanguageJavaApache License 2.0Apache-2.0

FDBLucene

FDBLucene is a new project to store Lucene indexes into FoundationDB while providing high performance for both indexing and searching.

Build

Requires Apache Maven to build:

https://maven.apache.org/

Run mvn clean install -DskipTests

Tests

Requires a local FoundationDB cluster

https://www.foundationdb.org/

to be installed and running.

Run mvn test to run the unit tests included in src/test.

Approaches

This repository contains two different approaches to storing Lucene indexes in FoundationDB. At this time the FDBDirectory approach is active candidate.

FDBIndexWriter / FDBIndexReader

These classes implement a subset of Lucene's features.

The principal advantage of this approach over FDBDirectory is that it removes the need for an exclusive writer. Multiple instances of FDBIndexWriter can safely add, update or delete documents from the same index concurrently.

DATA.md describes the format of the keys and values used to build the index and serve requests.

Deviations from Lucene

We are no longer using;

  1. the IndexWriter class
  2. the notion of a Directory
  3. the notion of a Codec
  4. Field numbers

FDBIndex{Reader,Writer} only implements a subset of Lucene's features though more may be added over time. DocValues and Points are not completely supported but numeric lookup and range querying is possible with FDBNumericPoint and sorting by number with the standard NumericDocValuesField class.

FDBDirectory

This class is a full implementation of Lucene's Directory abstraction.

Lucene expects to write to disk (via a file system) and uses an inverted index for this reason. To balance the optimal on-disk format with the need to efficiently update an index, Lucene creates multiple "segments". Each of these segments is an index in its own right, though Lucene makes it easy to search across all segments.

Because Lucene assumes a file system, it defines its own transactional semantics. Firstly, a lock file is used to ensure there is only a single writer to the index at a time. Secondly, data that is written to a file is not required to be visible until the file is closed. Finally, there is a central file (called the segments file) which names the other files in the directory which constitute the index. This allows Lucene to build files in the index without making them immediately visible. The segments file is itself updated atomically.

These design decisions within Lucene guide us to where, and whether, to apply FDB transactional semantics. When writing to a new file, for example, we have no need to put a transaction around the data we're writing. FoundationDB, of course, requires one, but it has no semantic meaning to Lucene. We can therefore buffer as much data as we like to form an optimal transaction size. In contrast, the rename method is atomic.

FDBDirectory stores all its data in FoundationDB using a user-specified key prefix, represented as a Subspace. Each file within the index is given a unique number, generated by a per-index counter entry. Binary data within the file are stored as pages. This is essentially the https://apple.github.io/foundationdb/largeval.html pattern.

Lucene creates empty files, fills them with data by appending, and then closes them. The files are never updated again. They are therefore highly cacheable. FDBLucene exploits this property by caching every page that it reads from any file. The behaviour, and capacity, of that cache is configurable by the user as FDBLucene uses Apache JCS (http://commons.apache.org/jcs/). The cache for an individual file is only valid until the enclosing Directory is closed in order to avoid any cache coherency issues if an index is deleted and recreated.