Florents-Tselai/WarcDB

init & import workflow

Florents-Tselai opened this issue · 3 comments

I feel that the init may be redundant and/or should be done implicitly.
For first-time users, I'd like them to be able to get up and running ASAP.
That is:

warcdb import f1.warc f2.warcz f3.warc.gz ... 

Adding an initialization step however is necessary to ensure that we're in sync with the "current" relational representation of a warc file. But such a representation will undoubtedly change either drastically (table renames) or incrementally (OLAP-like views will be added)

In practical terms, that means that when an import command is issued, the following steps happen:

  • When creating a new archive (the file we're importing into does not exist), just use the latest schema and proceed normally.
  • If the archive file exists, figure out its current version as stored in the DB. If the package version is newer, apply migrations and proceed with the import.
  • If the package version is older than the archive's version, abort and prompt the user to upgrade the package.

Notes

  • The current package (application) and schema versions are coupled, and I don't see a reason to change that.
  • IIRC, for v0.1.0, no such data was stored in the DB, which is a shame, so we should make this default if it does not exist and store it explicitly for v0.2.0 and later.
edsu commented

I understand wanting to do away with init. It reminded me of git, which I liked. But if you really don't like it perhaps an implicit migrate could happen whenever you run import?

Yes, pretty much what I described above, right ?

edsu commented

I think so!