Explore options for improving RAG

Question

Explore options for improving RAG

Opened this issue 19 days ago · 5 comments

The current RAG implementation isn't great. It is primarily keyword based and is agentic, however it does work with any LLM model.

The implementation could be significantly improved by;

Allowing the LLM to generate search queries across user databases
Having a proper vector database implementation

Explore using libraries such as:

Answer 1 · 2024-12-01T21:22:37.000Z

After a lot of experiments with different RAG libraries, I decided to build an agent using langchain and build a suite of tools the agent can use the answer prompts.

For now, I'm using the legacy Langchain Agent framework, not the new LangSmith framework as it appeared better supported and easier to use.

Sidenote: I've found langchain to be quite buggy. The documentation often doesn't match the actual code implementation. Many of the community libraries have issues.

Implemented tools

Query tools for all user databases (emails, calendar, files etc.)
Vector database tool (using tensorflow for local embedding generation)
Chat thread query tool

This can easily be expanded in the future to add support for any other langchain built-in tools (ie: web search, wikipedia, web page extraction).

Results

Results so far are much better as the LLM can easily search user data and do keyword search using the vector database to get access to the most relevant information to answer a prompt.

It can now answer things like; What upcoming meetings do I have?, How many emails I have I received from Ryan, Review my sent emails to create a personality profile, How often do I send things.

The agent supports utilizing multiple tools in sequence, so it may request some data from the database and then do a vector database search to get more results (or vice versa). The agent is in complete control of this process, utilizing the available tools until it has enough information to answer the prompt.

LLM Model Notes

Using a langchain agent with tools, requires using a LLM model interface that supports tools. Currently this limits us to using Anthropic models as we are currently using AWS Bedrock and that's the only model that supports using tools. I have chosen Claude 3.5 Haiku for now. I would prefer to be using Llama as it's open source and self-hosted via RedPill.

I have experimented with RedPill.ai (https://red-pill.ai/) which is fast and runs in TEE's, however their API interface doesn't appear to be compatible with langchain. I have asked them for more info on this.

Embeddings

I am using Tensforflow for arm64 that is run locally to generate embeddings. This is quite a slow process on my local Mac. Once we have a secure connection to RedPill via langchain, this can potentially be replaced to run remotely and ideally will provide faster performance as well.

The embeddings are currently stored using the CloseVector library and are cached on disk for increased performance.

It is worth considering encrypted these generated vector database indexes and storing somewhere for fast reloading.

Answer 2 · 2024-12-01T21:22:40.000Z

Remaining tasks:

Known issues:

Token totals in agent response doesn't include the final LLM request (langchain doesn't seem to support this out of the box)

Future enhancements:

~~Enable API support for specifying which tools are used~~ See #192
~~Create profile objects on demand via a tool (cache for 7 days?)~~ See #193

Answer 3 · 2024-12-01T22:35:45.000Z

LLM update

I have been able to switch from AWS Bedrock to Together.ai and get that working with Llama3.1 in langchain.

Llama-3.1-70B-Instruct-Turbo produces much worse results than using Claude 3.5 Haiku.
Llama-3.2-3B-Instruct-Turbo doesn't support JSON /function calling so can't be used as an agent with tools

Note: Many models don't support JSON / function.

This leaves a conundrum, the open source Llama model doesn't work as well as the proprietary models (ChatGPT, Claude). Perhaps we give the user a choice?

Claude (better results, but data is via AWS bedrock and subject to their privacy policy)
Llama (not as good results, but 100% private, running in TEE via Redpill)

Answer 4 · 2024-12-02T01:19:28.000Z

Embeddings update

There is a performance bottleneck associated with creating embeddings and storing in the vector database. I also have been getting out of memory errors on my local Mac when running Tensor Flow embeddings locally.

I tried generating embeddings via AWS Bedrock, however the network latency for lots of small documents made that even slower, plus it costs money and sends more data to AWS infra which isn't ideal.

Answer 5 · 2024-12-11T23:38:33.000Z

LLM Update

I am hitting issues if I directly ask Claude for my personal information due to it's privacy controls. Asking it to complete a JSON object with that information seems to work okay though, which is a relief.