Add support for different knowledge retrieval methods

Question

Add support for different knowledge retrieval methods

transitive-bullshit opened this issue a year ago · 0 comments

transitive-bullshit commented a year ago

This is for the built-in retrieval tool.

Currently, the current knowledge retrieval implementation uses a very naive retrieval which simply returns the full contents of every attached file (source).

The current implementation also only support text file types like text/plain and markdown, as no preprocessing or conversions are done at the moment.

It shouldn't be too hard to add support for more legit knowledge retrieval approaches, which would require:

processForFileAssistant - File ingestion pre-processing for files marked with purpose: 'assistants'
- converting non-text files to a common format like markdown (this is probably the hardest step to do well across all of the most common file types)
- chunking files
- embedding chunks
- storing embeddings to an external vector store; make sure to store the file_id each chunk comes from for filtering purposes
retrievalTool - Performs knowledge retrieval for a given query on a set of file_ids for RAG.
- embed query
- semantic search over vector store filtering by the given file_ids

Integrations here with LangChain and/or LlamaIndex would be great for their flexibility, but we could also KISS and roll out own using https://github.com/dexaai/dexter