Hakukone ("search engine" in Finnish) is a simple full-text search backend specifically for the Finnish language.
Given a set of Finnish text documents, Hakukone provides a web service with the following functions:
- Keyword search using Voikko for lemmatization and BM25+ for ranking
- Embedding-based search using cosine distance for ranking
- Adding a batch of new documents
- Marking a document as soft-deleted
- Compacting i.e. rewriting the database in the background while dropping soft-deleted documents
While this should be regarded as a hobby project that pales in comparison to established full text search systems like Solr/Elasticsearch/Meilisearch/Tantivy, it has a couple of advantages:
- It uses Voikko for lemmatization, whereas other systems seem to use more coarse stemming rules.
- It supports both keyword-based search and embedding-based ("semantic") search. Many alternatives don't provide both.
- It's small and simple, and thus relatively easy to understand, modify, deploy and maintain.
This project aims to do a decent job of full-text search in Finnish. Therefore it may develop additional tuning for the Finnish language e.g. in the way that Voikko is applied.
Otherwise the project is pretty much feature complete. For instance, it does not aim to:
- provide a UI
- combine keyword-based and embedding-based search results
- A frontend can use e.g. RRF to do that.
- structured documents (fields, schemas, ...)
- reach state-of-the-art performance for large datasets (see below)
Performance is good enough for small to medium datasets.
With the Finnish Wikipedia dataset (~880k documents, ~100M words as of 2024-05-01), using a Ryzen 5900X with Precision Boost off and 4.2GHz all core max:
- Embedding-based search with 1536-dimensional embeddings takes about 150ms with all 24 threads,
or 1.1 seconds with a single thread.
- Scaling:
O(documents_in_database * embedding_length)
- Scaling:
- Keyword search takes about 20-30ms per uncommon search word,
but very common words like "ja" can take around 250ms (or 1 second with a single thread).
- Scaling:
O(query_words * documents_matched)
- Scaling:
Scaling could be improved with indices like HNSW, but for my simple use cases I prefer the simplicity and predictability of the current setup: no need to tune index parameters for speed/accuracy etc. Horizontal scaling is also possible.
You can do an offline data import on the command line as shown here, or you can use the API.
Prepare a file with one JSON object per line. The objects should have the following fields:
id
(string)text
(string)emb
(list of numbers) - the embeddings, e.g. from OpenAI.
Pipe the file to the following command to import the data:
cat data.jsonl | cargo run -- import-jsonl --data-dir ./data --embedding-len 1536
cargo run -- serve --host 127.0.0.1 --data-dir ./data --embedding-len 1536
Test:
curl \
-X POST \
-H 'Content-Type: application/json' \
-d '{"query": "kissan karvat", "limit": 10}' \
http://localhost:3888/api/search/by-text
See API below for further instructions.
There's a Dockerfile:
docker build -t hakukone:latest .
docker run -d -p 3888:3888 -v ./data:/app/data hakukone:latest serve --embedding-len 1536
While this project uses very little unsafe Rust, it has some dependencies written in C/C++. Most importantly, text queries are passed to Voikko's tokenizer and analyzer. While Voikko is mature software, input validation and strong sandboxing may still be wise.
Search latency and throughput can be improved with additional cores.
Sharding and replication are possible by running multiple instances.
In any deployment, you should keep all document insertions in a reliable queue in your main application database so they are not lost if a search server is temporarily unavailable.
Ideally your main application should have a way to re-insert all documents to the search server.
Alternatively, you can back up the file data/live/originals.jsonl
while the server is running.
It can then be imported to rebuild the search index.
The file is append-only while no compactions are performed, which may be useful for incremental backups.
All requests are POST with a JSON body.
Does keyword-based search using BM25+ for ranking.
query
: The search keywords separated by whitespace. (No fancy syntax such as boolean operations or quoted phrases are supported.)limit
: The number of results to return.
results
: A list of objects with the following fields:doc_id
(string)score
(number) - BM25+ score: higher is better
Does embedding-based search using cosine distance for ranking.
embedding
: The embedding - a list of numbers.limit
: The number of results to return.
results
: A list of objects with the following fields:doc_id
(string)score
(number) - Cosine distance score: lower is better
Indexes new documents to the database.
Idempotence: If any of the provided document IDs are already in the database (even if marked for deletion), they will not be overwritten, but their IDs are returned in the response.
To work around the inability to efficiently overwrite documents,
you can design your document IDs to be something like <internal-id>@<version-or-timestamp>
.
This request will fail with a HTTP 409 (Conflict) if there's an ongoing compaction. The frontend should retry later in this case.
-
docs
: A list of objects with the following fields:id
(string)text
(string)emb
(the embeddding - a list of numbers)
existing_doc_ids
(list of strings) - document IDs that already existed in the database and were this ignored
Marks a document as soft-deleted. It will no longer appear in search results, and it will be removed from the database on compaction.
This request will fail with a HTTP 409 (Conflict) if there's an ongoing compaction. The frontend should retry later in this case.
Idempotence: the request will succeed even if the document does not exist or is already marked as soft-deleted.
doc_id
(string)
Empty object.
Marks all documents whose document ID starts with the given prefix.
This request will fail with a HTTP 409 (Conflict) if there's an ongoing compaction. The frontend should retry later in this case.
Idempotence: the request will succeed even if some or all of the documents do not exist or are already marked as soft-deleted.
doc_id_prefix
(string)
Empty object.
Rewrites the database in the background and swaps it in for the live database when complete.
Only one compaction can run at a time. This request will fail with a HTTP 409 (Conflict) if there's already an ongoing compaction.
Currently this request blocks until the compaction is complete. This may take a long time if the dataset is large.
Empty object.
Empty object.
Completely clears the database.
This request will fail with a HTTP 409 (Conflict) if there's an ongoing compaction. The frontend should retry later in this case.
Empty object.
Empty object.
- Voikko for stemming/lemmatization
- RocksDB for storing the reverse index (word lemmas -> documents)
- BM25+ for ranking in keyword-based search
- simsimd and Rayon for searching through the embeddings (no index at the moment)
- Actix Web for serving the HTTP API
GPLv3 or later (same as Voikko)