Add support for different knowledge retrieval methods
transitive-bullshit opened this issue · 0 comments
This is for the built-in retrieval
tool.
Currently, the current knowledge retrieval implementation uses a very naive retrieval which simply returns the full contents of every attached file (source).
The current implementation also only support text file types like text/plain
and markdown, as no preprocessing or conversions are done at the moment.
It shouldn't be too hard to add support for more legit knowledge retrieval approaches, which would require:
-
processForFileAssistant
- File ingestion pre-processing for files marked withpurpose: 'assistants'
- converting non-text files to a common format like
markdown
(this is probably the hardest step to do well across all of the most common file types) - chunking files
- embedding chunks
- storing embeddings to an external vector store; make sure to store the
file_id
each chunk comes from for filtering purposes
- converting non-text files to a common format like
-
retrievalTool
- Performs knowledge retrieval for a givenquery
on a set offile_ids
for RAG.- embed
query
- semantic search over vector store filtering by the given
file_ids
- embed
Integrations here with LangChain and/or LlamaIndex would be great for their flexibility, but we could also KISS and roll out own using https://github.com/dexaai/dexter