/tantivy

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust

Primary LanguageRustMIT LicenseMIT

Build Status codecov Join the chat at https://gitter.im/tantivy-search/tantivy License: MIT Build status Crates.io

Tantivy

Become a patron

Tantivy is a full text search engine library written in Rust.

It is closer to Apache Lucene than to Elasticsearch or Apache Solr in the sense it is not an off-the-shelf search engine server, but rather a crate that can be used to build such a search engine.

Tantivy is, in fact, strongly inspired by Lucene's design.

Benchmark

The following benchmark break downs performance for different type of queries / collection.

Your mileage WILL vary depending on the nature of queries and their load.

Features

  • Full-text search
  • Configurable tokenizer (stemming available for 17 Latin languages with third party support for Chinese (tantivy-jieba and cang-jie), Japanese (lindera and tantivy-tokenizer-tiny-segmente) and Korean (lindera + lindera-ko-dic-builder)
  • Fast (check out the 🐎 ✨ benchmark ✨ 🐎)
  • Tiny startup time (<10ms), perfect for command line tools
  • BM25 scoring (the same as Lucene)
  • Natural query language (e.g. (michael AND jackson) OR "king of pop")
  • Phrase queries search (e.g. "michael jackson")
  • Incremental indexing
  • Multithreaded indexing (indexing English Wikipedia takes < 3 minutes on my desktop)
  • Mmap directory
  • SIMD integer compression when the platform/CPU includes the SSE2 instruction set
  • Single valued and multivalued u64, i64, and f64 fast fields (equivalent of doc values in Lucene)
  • &[u8] fast fields
  • Text, i64, u64, f64, dates, and hierarchical facet fields
  • LZ4 compressed document store
  • Range queries
  • Faceted search
  • Configurable indexing (optional term frequency and position indexing)
  • Cheesy logo with a horse

Non-features

  • Distributed search is out of the scope of Tantivy. That being said, Tantivy is a library upon which one could build a distributed search. Serializable/mergeable collector state for instance, are within the scope of Tantivy.

Getting started

Tantivy works on stable Rust (>= 1.27) and supports Linux, MacOS, and Windows.

How can I support this project?

There are many ways to support this project.

  • Use Tantivy and tell us about your experience on Gitter or by email (paul.masurel@gmail.com)
  • Report bugs
  • Write a blog post
  • Help with documentation by asking questions or submitting PRs
  • Contribute code (you can join our Gitter)
  • Talk about Tantivy around you
  • Become a patron

Contributing code

We use the GitHub Pull Request workflow: reference a GitHub ticket and/or include a comprehensive commit message when opening a PR.

Clone and build locally

Tantivy compiles on stable Rust but requires Rust >= 1.27. To check out and run tests, you can simply run:

    git clone https://github.com/tantivy-search/tantivy.git
    cd tantivy
    cargo build

Run tests

Some tests will not run with just cargo test because of fail-rs. To run the tests exhaustively, run ./run-tests.sh.

Debug

You might find it useful to step through the programme with a debugger.

A failing test

Make sure you haven't run cargo clean after the most recent cargo test or cargo build to guarantee that the target/ directory exists. Use this bash script to find the name of the most recent debug build of Tantivy and run it under rust-gdb:

find target/debug/ -maxdepth 1 -executable -type f -name "tantivy*" -printf '%TY-%Tm-%Td %TT %p\n' | sort -r | cut -d " " -f 3 | xargs -I RECENT_DBG_TANTIVY rust-gdb RECENT_DBG_TANTIVY

Now that you are in rust-gdb, you can set breakpoints on lines and methods that match your source code and run the debug executable with flags that you normally pass to cargo test like this:

$gdb run --test-threads 1 --test $NAME_OF_TEST

An example

By default, rustc compiles everything in the examples/ directory in debug mode. This makes it easy for you to make examples to reproduce bugs:

rust-gdb target/debug/examples/$EXAMPLE_NAME
$ gdb run