OpenFn/apollo

Embed docs.openfn.org into a vector database

josephjclark opened this issue · 0 comments

See #71 for a spec on adding an vector database to Apollo.

Once we have a vector database waiting to go, we need to work out how to encode the docs site into it. This will then be used by services like chat and the job generator to add really focused context to prompts

I think the process is something like this:

  • Clone the docs repo and build it
    • This is quite a computationally expensive step - but we do need to do it to get a nice clean markdown representation of all our docs. Would it be easier to scrape the HTML site at docs.openfn.org instead? I don't think so?
  • Pull all the parsed .md files into string
  • Break each .md file up into chunks by section. I think a section is bound by ## and another ## or the end of the document
  • I don't know if we need to encode any context into the section, like a path?
  • Embed each section into the database.

It is likely to be several distinct commands: build the doc site, extract the content chunks, and embed the content chunks.

This process all needs to run at build-time, when the Docker image is assembled, so that the database is nicely pre-seeded when it gets deployed.