/sqlite-rembed

A SQLite extension for generating text embeddings from remote APIs (OpenAI, Nomic, Ollama, llamafile...)

Primary LanguageRustApache License 2.0Apache-2.0

sqlite-rembed

A SQLite extension for generating text embeddings from remote APIs (OpenAI, Nomic, Cohere, llamafile, Ollama, etc.). A sister project to sqlite-vec and sqlite-lembed. A work-in-progress!

Usage

.load ./rembed0

INSERT INTO temp.rembed_clients(name, options)
 VALUES ('text-embedding-3-small', 'openai');

select rembed(
  'text-embedding-3-small',
  'The United States Postal Service is an independent agency...'
);

The temp.rembed_clients virtual table lets you "register" clients with pure INSERT INTO statements. The name field is a unique identifier for a given client, and options allows you to specify which 3rd party embedding service you want to use.

In this case, openai is a pre-defined client that will default to OpenAI's https://api.openai.com/v1/embeddings endpoint and will source your API key from the OPENAI_API_KEY environment variable. The name of the client, text-embedding-3-small, will be used as the embeddings model.

Other pre-defined clients include:

Client name Provider Endpoint API Key
openai OpenAI https://api.openai.com/v1/embeddings OPENAI_API_KEY
nomic Nomic https://api-atlas.nomic.ai/v1/embedding/text NOMIC_API_KEY
cohere Cohere https://api.cohere.com/v1/embed CO_API_KEY
jina Jina https://api.jina.ai/v1/embeddings JINA_API_KEY
mixedbread MixedBread https://api.mixedbread.ai/v1/embeddings/ MIXEDBREAD_API_KEY
llamafile llamafile http://localhost:8080/embedding None
ollama Ollama http://localhost:11434/api/embeddings None

Different client options can be specified with remebed_client_options(). For example, if you have a different OpenAI-compatible service you want to use, then you can use:

INSERT INTO temp.rembed_clients(name, options) VALUES
  (
    'xyz-small-1',
    rembed_client_options(
      'format', 'openai',
      'url', 'https://api.xyz.com/v1/embeddings',
      'key', 'xyz-ca865ece65-hunter2'
    )
  );

Or to use a llamafile server that's on a different port:

INSERT INTO temp.rembed_clients(name, options) VALUES
  (
    'xyz-small-1',
    rembed_client_options(
      'format', 'lamafile',
      'url', 'http://localhost:9999/embedding'
    )
  );

Using with sqlite-vec

sqlite-rembed works well with sqlite-vec, a SQLite extension for vector search. Embeddings generated with rembed() use the same BLOB format for vectors that sqlite-vec uses.

Here's a sample "semantic search" application, made from a sample dataset of news article headlines.

create table articles(
  headline text
);

-- Random NPR headlines from 2024-06-04
insert into articles VALUES
  ('Shohei Ohtani''s ex-interpreter pleads guilty to charges related to gambling and theft'),
  ('The jury has been selected in Hunter Biden''s gun trial'),
  ('Larry Allen, a Super Bowl champion and famed Dallas Cowboy, has died at age 52'),
  ('After saying Charlotte, a lone stingray, was pregnant, aquarium now says she''s sick'),
  ('An Epoch Times executive is facing money laundering charge');


-- Build a vector table with embeddings of article headlines, using OpenAI's API
create virtual table vec_articles using vec0(
  headline_embeddings float[1536]
);

insert into vec_articles(rowid, headline_embeddings)
  select rowid, rembed('text-embedding-3-small', headline)
  from articles;

Now we have a regular articles table that stores text headlines, and a vec_articles virtual table that stores embeddings of the article headlines, using OpenAI's text-embedding-3-small model.

To perform a "semantic search" on the embeddings, we can query the vec_articles table with an embedding of our query, and join the results back to our articles table to retrieve the original headlines.

param set :query 'firearm courtroom'

with matches as (
  select
    rowid,
    distance
  from vec_articles
  where headline_embeddings match rembed('text-embedding-3-small', :query)
  order by distance
  limit 3
)
select
  headline,
  distance
from matches
left join articles on articles.rowid = matches.rowid;

/*
+--------------------------------------------------------------+------------------+
|                           headline                           |     distance     |
+--------------------------------------------------------------+------------------+
| The jury has been selected in Hunter Biden's gun trial       | 1.05906391143799 |
+--------------------------------------------------------------+------------------+
| Shohei Ohtani's ex-interpreter pleads guilty to charges rela | 1.2574303150177  |
| ted to gambling and theft                                    |                  |
+--------------------------------------------------------------+------------------+
| An Epoch Times executive is facing money laundering charge   | 1.27144026756287 |
+--------------------------------------------------------------+------------------+
*/

Notice how "firearm courtroom" doesn't appear in any of these headlines, but it can still figure out that "Hunter Biden's gun trial" is related, and the other two justice-related articles appear on top.

Drawbacks

  1. No batch support yet. If you use rembed() in a batch UPDATE or INSERT in 1,000 rows, then 1,000 HTTP requests will be made. Add a 👍 to Issue #1 if you want to see this fixed.
  2. No builtin rate limiting. Requests are sent sequentially so this may not come up in small demos, but sqlite-rembed could add features that handles rate limiting/retries implicitly. Add a 👍 to Issue #2 if you want to see this implemented.