Importing fails if a WARC file misses some records.

Question

Importing fails if a WARC file misses some records.

Florents-Tselai opened this issue a year ago · 2 comments

Assuming a newly created archive.db
warcdb import archive.db ./tests/apod.warc.gz
fails with a sqlite_utils.db.AlterError: No such column: warcinfo.WARC-Record-ID

If however, one does it like this, it works fine

warcdb import archive.db ./tests/google.warc
warcdb import archive.db ./tests/apod.warc.gz
./tests/apod.warc.gz: 807it [00:00, 1249.71it/s]

That is because google.warc is a "complete" - ideal warc file and the db schema is appropriately created.

Answer 1 · 2023-10-17T07:47:51.000Z

The proper way to do this is ship the packages themselves with a predefined SQL schema, which would require a new version every time the schema changes. This is fine, but the WARC --> relational transformation as-is is just. a personal preference that I'm not convinced that it's' good enough to nail it.

Then again we don't want to have a failed import every time a non-complete warc record is supplied.

Answer 2 · 2023-10-17T08:47:50.000Z

I wonder if a stub warcinfo record could be generated if one isn't found when encountering the first record?

But I see your point that we probably need to define a schema first? Perhaps we could use the wget schema as canonical? It looks like sqlite-utils lets you create tables without inserting:

https://sqlite-utils.datasette.io/en/stable/python-api.html#explicitly-creating-a-table

Would it be weird to require users to do a warcb init warc.db prior to importing records?

It looks like it's pretty new but I wonder if @simonw's https://github.com/simonw/sqlite-migrate could be useful here?