Making the vector DB table metadata fields more consistent
Closed this issue · 11 comments
It's been annoying me for a while that each of the metadata fields for our vector DB are a bit different:
The ones in pgvector_data_doc
use:
tags TEXT[] -- tags associated with the chunk
And pgvector_message
uses:
metadata JSON -- additional metadata of the message
I think this second pattern is the one we should be using, so in effect it would just be an update to ogbujipt.embedding.pgvector_data_doc
.
@choccccy & @olajiide pls create a branch & eventually PR off this ticket for this one. Obvs it will require changes to the test suite, as well.
Investigate Migrations
I think it may be time we start to publish migrations for our DB changes. What that means is that it's some sort of simple script an admin using our PGVector features can run after a DB schema has changed, in order to have semantically the same data, but in the updated schema.
I haven't done a ton of DB admin myself, bu a colleague of mine used to use the Alembic/SQLAlchemy migration facilities for this. SQLAlchemy is an object-relational mapping (ORM), but from what I understand, you don't actually need to use the ORM to use its migration features. I think it works at the raw SQL level.
Anyway, could you guys do some investigation/learning on this topic and see how we might want to approach migration scripts, e.g. from <=0.8.0 to 0.9.0 release of OgbujiPT?
If Alembic doesn't seem suitable, here are other standalone DB migration tools I found after some searching:
Great high level overview: "Get Your Database Under Version Control"
Handy article on low-level SQL migrations (don't think they ever published a pt 2)
We are currently proceeding by just removing anything complicated related to tags
s (search by tags) and assuming that if the user wants that sort of more advanced functionality, that they'll build it themselves by referencing the metadata
On reconsideration, maybe this change will be simple enough (and with a lean enough footprint) that we don't need anything more than a quick Python script to forward-only migrate? I still think we should start to educate ourselves & prepare for how we'll handle this in future.
right now testing and the demo notebook needs fixing
Discussion with @choccccy perhaps sheds a light on why we had this inconsistency in the first place. The more rigid structure (in the data & doc classes) allowed the conjunctive or subjunctive tag selection as part of query. I think the best way to restore this will be a library of callbacks to specialize the query by tags, for example, to match a top-level key of page 1:
page_one_filt = match_any('page', 1, cast='int')
result = await DB.search(text='Hello world!', meta_filter=page_one_filt)
Behind the scenes match_any
would be used by the DataDB
class to update the SQL query in injection-safe fashion. I think it adds a bit of syntactic bitterness in exchange for far more generalization/flexibility. As @choccccy already suggested, this would also eliminate the need for the DocDB
class.
@choccccy I know this is tricky, and I'm happy to implement, if it makes sense. We don't need it straight out of the gates, so you can just simplify as we'd already discussed, and I'll come back and add the meta_filter system via a new PR.
Also, since the branch was created without a direct link to this ticket, I'll mention that https://github.com/OoriData/OgbujiPT/tree/normalize-metadata is meant to close this.
testing sorted as of commit 05db4b5, on to sorting out the jupyter notebook
probably acceptable to call this done enough for 0.9.0 release, and i can finish fixing the notebook soon™
ok! i think that's everything. starting a PR
We probably also need to annotate some sort of version metadata on the tables we create. Should be able to use table comments, though we might also need to accept that if a DBA has chosen table comments of their own, we can't automatically do a version check for them.
Interesting, but probably not quite our use-case (more for a rigorous audit mindset): Bemi
OK, now that we have an example of basic metadata criteria check, I'll close this. As more use cases come to light I can expand the repertoire.