Importing fails if a WARC file misses some records.
Florents-Tselai opened this issue · 2 comments
Assuming a newly created archive.db
warcdb import archive.db ./tests/apod.warc.gz
fails with a sqlite_utils.db.AlterError: No such column: warcinfo.WARC-Record-ID
If however, one does it like this, it works fine
warcdb import archive.db ./tests/google.warc
warcdb import archive.db ./tests/apod.warc.gz
./tests/apod.warc.gz: 807it [00:00, 1249.71it/s]
That is because google.warc is a "complete" - ideal warc file and the db schema is appropriately created.
The proper way to do this is ship the packages themselves with a predefined SQL schema, which would require a new version every time the schema changes. This is fine, but the WARC --> relational transformation as-is is just. a personal preference that I'm not convinced that it's' good enough to nail it.
Then again we don't want to have a failed import every time a non-complete warc record is supplied.
I wonder if a stub warcinfo record could be generated if one isn't found when encountering the first record?
But I see your point that we probably need to define a schema first? Perhaps we could use the wget schema as canonical? It looks like sqlite-utils lets you create tables without inserting:
https://sqlite-utils.datasette.io/en/stable/python-api.html#explicitly-creating-a-table
Would it be weird to require users to do a warcb init warc.db
prior to importing records?
It looks like it's pretty new but I wonder if @simonw's https://github.com/simonw/sqlite-migrate could be useful here?