quarkiverse/quarkus-langchain4j

Way to write out embeddings to a file?

edeandrea opened this issue ยท 12 comments

Discussed in #643

Originally posted by edeandrea May 30, 2024
I'm using the easy-rag extension which loads a pdf file at startup (in dev mode) to an embedding model which is stored in redis.

Is there a way that I could write out the embeddings to a file rather than re-computing the embeddings? I'm using ollama with mistral, so my dev mode startup is about 10-15 minutes, and then 10-15 mins each time I make a change while in dev mode.

If it were to write out the embeddings to a file the first time, then each subsequent time it would only need to load the embeddings into the store, that would be much faster, especially since (in my case) the file isn't changing.

It actually already is there in upstream, at least when using the in memory embedding store. See https://github.com/langchain4j/langchain4j/blob/eba2b1cd500f7b8874d4da5e406f09935febb9b3/langchain4j/src/main/java/dev/langchain4j/store/embedding/inmemory/InMemoryEmbeddingStore.java#L132-L164

I think all we'd need here in the short term would be some stuff in the easy rag extension that did something with these methods.

What I'm thinking is that, for now, we only support it for in memory embeddings. We introduce these properties

# Whether or not to turn on this feature
# Default is false, preserving existing functionality
quarkus.langchain4j.easy-rag.reuse-embeddings.enabled=false|true

# Path to where to store/read the embeddings file
# Must be set to something if quarkus.langchain4j.easy-rag.reuse-embeddings.enabled==true
quarkus.langchain4j.easy-rag.reuse-embeddings.file=<some file path>

If quarkus.langchain4j.easy-rag.reuse-embeddings.enabled==false then there won't be any change to existing functionality.

If quarkus.langchain4j.easy-rag.reuse-embeddings.enabled==true, then on startup quarkus.langchain4j.easy-rag.reuse-embeddings.file will be examined. If it exists, then the embedding store will be loaded from it (rather than querying the EmbeddingModel). If it doesn't exist, then the embeddings will be loaded as they are today, and then they will be written out to that file, so that during a restart the EmbeddingModel doesn't need to be re-queried.

@jmartisk / @geoand thoughts on this approach? If everyone is in agreement then I can take up implementing it.

Just to make sure we're on the same page, is this how it will work?

  • if ingestion-strategy=OFF, then ingestion is skipped altogether
  • if ingestion-strategy=ON
    • if reuse-embeddings=true
      • if embedding file exists
        • ingest from the file
      • else
        • if the target embedding store is an instance of InMemoryEmbeddingStore
          • ingest using the embedding model into the InMemoryEmbeddingStore and dump it into the file
        • else (if there is a persistent store configured)
          • ingest using the embedding model into an InMemoryEmbeddingStore first, then dump it into the file, and re-ingest into the target store
    • else
      • ingesting using the embedding model

I think a flow diagram or a similar pseudocode snippet would be nice to have in the documentation, because it's getting a little complex.

Its actually a little simpler than that. No logic changes whatsoever unless using an InMemoryEmbeddingStore. Only the InMemoryEmbeddingStore contains methods for dealing with reading from/writing to filesystem data.

I'd be sure to add something prettier to the documentation once completed :)

if (ingestion-strategy == OFF) {
  ingestion skipped altogether (same as today)
}
else {
  if ((reuse-embeddings == true) && (embeddingStore instanceof InMemoryEmbeddingStore)) {
    if (ingestion-file.exists()) {
      ingest from file
    }
    else {
      ingest from embedding model and write out to file (next startup will ingest from file)
    }
  }
  else {
    Ingest using the embedding model (what we do today)
  }
}

Also, if reuse-embeddings == true and the user does not configure a file to ingest from, then that is an error condition which fails the startup.

  ingest from embedding model and write out to file (next startup will ingest from file)

but what if the target store is a persistent store? Then you need to ingest into a temporary in-memory store first - to be able to dump the contents into a file. And only then you can copy it into the final, persistent store. Unless you graft in some logic that generates the file on-the-fly, without actually using an InMemoryEmbeddingStore, and uses the same format.

Hm, that brings the idea that maybe there should be a way to dump embeddings into a file using some kind of stream, without having to create a full InMemoryEmbeddingStore, because it may be too much data to fit into the memory together. And likewise for reading embeddings from it.

but what if the target store is a persistent store? Then you need to ingest into a temporary in-memory store first - to be able to dump the contents into a file. And only then you can copy it into the final, persistent store. Unless you graft in some logic that generates the file on-the-fly, without actually using an InMemoryEmbeddingStore, and uses the same format.

For now, this would only be an enhancement if using an in-memory embedding store. If it is configured to use some other kind of persistent store, then nothing would change from what we're doing today.

The use case I'm really trying to solve here is with dev mode. Even with a small set of embeddings, generating & persisting the embeddings in an in-memory store can take several minutes, which is a major pain in dev mode. Each time I make a code change I need to wait sometimes 10 minutes for the app to restart, even for a small set of embeddings.

This would make it so in dev mode you only need to pay that price once and then that gets reused. Although this would work outside of dev mode too, its use case stems from dev mode.

Oh right, I thought it would make sense with persistent stores too in dev mode, but now I realize that the default behavior (for most dev services anyway) is to keep the same instance running across multiple restarts, so it doesn't need re-ingesting for every restart. Go for it then!

Here's a quick flowchart I drew that I'll include in the docs

image

Make sure to add that in the docs :)

Here's a cleaner version that I'll add to the doc

image

๐Ÿ‘ฟ