/IndexerMcIndexFace

Primary LanguageRustGNU General Public License v3.0GPL-3.0

A (toy) low-level document indexing and retrieval system

IndexerMcIndexFace is a tiny traditional document indexing and retrieval system that I wrote as an excuse to play with FSTs (using the BurntSushi/fst crate) and Rust's parallelization capabilities (using also the crossbeam crate for message passing)

Features:

  • Fully written in Rust
  • Uses FSTs for fast access to postings
  • Allows fielded documents, and uses the BM25F retrieval model (note: I didn't verify its correctness)
  • The indexing stage is paralellized with a threadpool by creating and merging independent indexes
    • (Note that this is a naive implementation, and although it's extremely fast it can be really memory hungry)
  • The retrieval stage is parallelized with a threadpool, where in this case it runs a different search for every token

Warnings:

  • This is a toy project (e.g: index files are not compressed, the parallelization techniques are naive and resource-hungry...) and the API is very basic.

Usage:

  • Simply run cargo run --release. main.rs will create a dummy collection of 1000 files using the MitchellRhysHall/random_word crate, and then will index and perform a randomised moderately sized query.

Possible improvements:

  • The use of FSTs opens up many possibilities, as regex-like searches can be easily performed.
  • Better parallelization techniques: Right now, each thread will create its own in-memory index, which will be later joined and written to binary files. This means that the memory usage can be very high for bigger collections of documents.
  • Better tokenizers.
  • N-gram or similar, more elaborate, indexes.
  • Alternative retrieval models, phrase queries, etc.