transitive-bullshit/OpenOpenAI

Add support for different knowledge retrieval methods

transitive-bullshit opened this issue · 0 comments

This is for the built-in retrieval tool.

Currently, the current knowledge retrieval implementation uses a very naive retrieval which simply returns the full contents of every attached file (source).

The current implementation also only support text file types like text/plain and markdown, as no preprocessing or conversions are done at the moment.

It shouldn't be too hard to add support for more legit knowledge retrieval approaches, which would require:

  • processForFileAssistant - File ingestion pre-processing for files marked with purpose: 'assistants'

    • converting non-text files to a common format like markdown (this is probably the hardest step to do well across all of the most common file types)
    • chunking files
    • embedding chunks
    • storing embeddings to an external vector store; make sure to store the file_id each chunk comes from for filtering purposes
  • retrievalTool - Performs knowledge retrieval for a given query on a set of file_ids for RAG.

    • embed query
    • semantic search over vector store filtering by the given file_ids

Integrations here with LangChain and/or LlamaIndex would be great for their flexibility, but we could also KISS and roll out own using https://github.com/dexaai/dexter