`sqlite-lembed`

A SQLite extension for generating text embeddings with llama.cpp. A sister project to sqlite-vec and sqlite-rembed. A work-in-progress!

Usage

sqlite-lembed uses embeddings models that are in the GGUF format to generate embeddings. These are a bit hard to find or convert, so here's a sample model you can use:

curl -L -o all-MiniLM-L6-v2.e4ce9877.q8_0.gguf https://huggingface.co/asg017/sqlite-lembed-model-examples/resolve/main/all-MiniLM-L6-v2/all-MiniLM-L6-v2.e4ce9877.q8_0.gguf

This is the sentence-transformers/all-MiniLM-L6-v2 model that I converted to the .gguf format, and quantized at Q8_0 (made smaller at the expense of some quality).

To load it into sqlite-lembed, register it with the temp.lembed_models table.

.load ./lembed0

INSERT INTO temp.lembed_models(name, model)
  select 'all-MiniLM-L6-v2', lembed_model_from_file('all-MiniLM-L6-v2.e4ce9877.q8_0.gguf');

select lembed(
  'all-MiniLM-L6-v2',
  'The United States Postal Service is an independent agency...'
);

The temp.lembed_models virtual table lets you "register" models with pure INSERT INTO statements. The name field is a unique identifier for a given model, and model is provided as a path to the .gguf model, on disk, with the lembed_model_from_file() function.

Using with `sqlite-vec`

sqlite-lembed works well with sqlite-vec, a SQLite extension for vector search. Embeddings generated with lembed() use the same BLOB format for vectors that sqlite-vec uses.

Here's a sample "semantic search" application, made from a sample dataset of news article headlines.

create table articles(
  headline text
);

-- Random NPR headlines from 2024-06-04
insert into articles VALUES
  ('Shohei Ohtani''s ex-interpreter pleads guilty to charges related to gambling and theft'),
  ('The jury has been selected in Hunter Biden''s gun trial'),
  ('Larry Allen, a Super Bowl champion and famed Dallas Cowboy, has died at age 52'),
  ('After saying Charlotte, a lone stingray, was pregnant, aquarium now says she''s sick'),
  ('An Epoch Times executive is facing money laundering charge');


-- Build a vector table with embeddings of article headlines
create virtual table vec_articles using vec0(
  headline_embeddings float[384]
);

insert into vec_articles(rowid, headline_embeddings)
  select rowid, lembed('all-MiniLM-L6-v2', headline)
  from articles;

Now we have a regular articles table that stores text headlines, and a vec_articles virtual table that stores embeddings of the article headlines, using the all-MiniLM-L6-v2 model.

To perform a "semantic search" on the embeddings, we can query the vec_articles table with an embedding of our query, and join the results back to our articles table to retrieve the original headlines.

param set :query 'firearm courtroom'

with matches as (
  select
    rowid,
    distance
  from vec_articles
  where headline_embeddings match lembed('all-MiniLM-L6-v2', :query)
  order by distance
  limit 3
)
select
  headline,
  distance
from matches
left join articles on articles.rowid = matches.rowid;

/*
+--------------------------------------------------------------+------------------+
|                           headline                           |     distance     |
+--------------------------------------------------------------+------------------+
| Shohei Ohtani's ex-interpreter pleads guilty to charges rela | 1.14812409877777 |
| ted to gambling and theft                                    |                  |
+--------------------------------------------------------------+------------------+
| The jury has been selected in Hunter Biden's gun trial       | 1.18380105495453 |
+--------------------------------------------------------------+------------------+
| An Epoch Times executive is facing money laundering charge   | 1.27715671062469 |
+--------------------------------------------------------------+------------------+
*/

Notice how "firearm courtroom" doesn't appear in any of these headlines, but it can still figure out that "Hunter Biden's gun trial" is related, and the other two justice-related articles appear on top.

Embedding Models in `.gguf` format

Most embeddings models out there are provided as PyTorch/ONNX models, but sqlite-lembed uses models in the GGUF file format. However, since ggml/GGUF is relatively new, they can be hard to find. You can always convert models yourself, or here's a few pre-converted embedding models already in GGUF format:

Model Name	Link
`nomic-embed-text-v1.5`	https://huggingface.co/nomic-ai/nomic-embed-text-v1.5-GGUF
`mxbai-embed-large-v1`	https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1

Drawbacks

No batch support yet. llama.cpp has support for batch processing multiple inputs, but I haven't figured that out yet. Add a 👍 to Issue #2 if you want to see this fixed.
Pre-compiled version of sqlite-lembed don't use the GPU. This was done to make compiling/distrubution easier, but that means it will likely take a long time to generate embeddings. If you need it to go faster, try compiling sqlite-lembed yourself (docs coming soon).

asg017/sqlite-lembed

sqlite-lembed

Usage

Using with sqlite-vec

Embedding Models in .gguf format

Drawbacks

`sqlite-lembed`

Using with `sqlite-vec`

Embedding Models in `.gguf` format