webrecorder/specs

Less options, more recommendations?

Closed this issue · 5 comments

ato commented

WACZ has many different ways of encoding the same information. This means everyone implementing the format needs to pick one when writing and has the burden of supporting all possibilities when reading. Too many options leads to compatibility problems as different implementations make different choices and likely not every combination is well tested or even supported by everyone.

I would like to suggest:

  1. Recommend a particular pair of archive and index formats which writers should prefer and readers are strongly encouraged to support. There are some reasonable reasons for having alternatives here (there are a lot of existing ARC files in the wild, CDX is more widely used, CDXJ has extra functionality) but giving guidance reduces unnecessary divergence.
  2. Pick only one way of encoding the 'pages' objects. I don't see any obvious reason for having three different options here but if there are such reasons then provide guidance as to when each option should be used.

That's a good point, but yeah, there's definitely a fine line between being flexible and offering too many options. This is designed to be a bundling/packaging/distribution format, rather than a raw data format, so perhaps should err on side of having more options is good?
A goal is to be able to easily zip up existing data, rather than converting to some new format, so definitely at least WARC and ARC need to be supported.

But also a key idea is that with random access, it should be possible to ignore formats that are unrecognized, or directories that are unneeded. Maybe it would be helpful to better codify the exact use cases. Of course replay has been my main concern, but perhaps other types of analysis would be helpful to include. Or, if that's too broad, maybe it should be aimed primarily at replay...

Re: pages. Yes, there's probably too many options.. was trying to iterate on the best approach.
I started with yaml, but then thought csv may be more accessible easier to use as its basically tabular data. Do you have a preference?

Another use case which I should add is: metadata overrides.
Let's say you create a large .wacz with some metadata, such as a specific set of pages. But then you want to modify some of the metadata/provide an alternate page list. That's where maybe specifying an alternative webarchive.yaml with overrides + an existing .wacz may be useful., as you don't want to re-upload a large file. Later, perhaps you can merge the webarchive.yaml if you want to replace the old metadata permanently distribution (I believe zip allows replacing files by appending a new entry), or not. In general, the metadata can change more quickly while the archive + index likely will not, and would be good to account for that somehow.
Perhaps one level of overrides would be a way to address this (sort of the way that command line arguments conventionally override config file settings in CLI programs)

ato commented

A goal is to be able to easily zip up existing data, rather than converting to some new format, so definitely at least WARC and ARC need to be supported.

Yep, I agree this is a good reason to have the option. It would also be not unreasonable for a collection to contain the same content in both formats. I don't think anybody should use ARC for any new collecting though and I don't think all tools should have to support it forever.

Re: pages. I started with yaml, but then thought csv may be more accessible easier to use as its basically tabular data. Do you have a preference?

I like the relative minimalism of CSV but that same minimalism tends to mean people don't strictly follow the standard and encounter constant compatibility problems. That's okay for data that naturally doesn't contain certain characters, but page names will often contain commas and double-quotes and I think many people will desire a description field containing newlines. We can rule out TSV as while often a simpler and more compatible choice it outright disallows newlines.

YAML has some good extra features for config or document formats (comments, references, optional human-friendly formatting) but is sometimes avoided as a data format due to those same features being unnecessary complexity and the ease of getting tripped up by special cases.

JSON is good for machine reading and writing as it balances simplicity with just enough power and is extremely widely supported. You can also directly pass it to UIs written in JavaScript and it's also already in use by CDXJ and other things like the youtube-dl manifests. It's not so pretty for humans but not terrible and there are some minor annoyances to humans caused by its strictness (trailing commas).

For those reasons personally I'd go with JSON. While YAML is definitely nicer for handwriting, I think most people will be using tools to generate the page lists rather than handwriting them. Those few who do want to handwrite them are the same type of people who wouldn't have any difficulty handwriting JSON. That said I can live with YAML.

Let's say you create a large .wacz with some metadata, such as a specific set of pages. But then you want to modify some of the metadata/provide an alternate page list. That's where maybe specifying an alternative webarchive.yaml with overrides + an existing .wacz may be useful., as you don't want to re-upload a large file.

Instead of having one big metadata file how about a /pagelists/ subdirectory with a file for each list. Each pagelist file contains both the list name and description plus also the list of entrypoint links:

{"title": "Foofest 2019",
 "description": "Pages relating to Foofest 2019",
 "pages":
  [{"url": "https://foofest2019.com/",
    "date": "2019-06-11T04:56:41Z",
    "title": "Foofest 2019 official website"},
   {"url": "https://alice.blogspot.com/2019/foofest-wahoo.html",
    "date": "2019-06-25T02:00:00Z",
    "title": "Alice's blog post about attending Foofest 2019"}]}

If you edit a list you can replace that one file by appending to the zip. I'm not sure if an additional override mechanism is necessary but it seems fair that tools could choose to interpret several wacz files or even directories as if you unpacked them over the top of each other, similar to how jar files work with java's classpath.

ato commented

Obviously under this model if a page appears in multiple lists it gets duplicated but since it's just a URL, date and title I don't see that's a real problem and it's simpler overall without the id linking. It's also scalable to very large collections with large numbers of lists. I think a curator may sometimes want to use different text when linking to the same page from different lists anyway based on context.

Yes, I am now leaning towards having the format should be more opinionated, eg. there should be 'one and only one way to store pages', but multiple ways that tools can convert to that format, be it from csv, json, yaml, etc...
I had initially envisioned making it easier to create the format 'by hand', but that seems unnecessary/error prone.
Absolutely agree that having only one way to do things will lessen the burden on implementors of tools.

I think we've pretty much adopted this recommendation :)

With the reference py-wacz tool for creating wacz, we're moving to a standardized output.

  1. The index format is a compressed CDXJ, as produced by py-wacz. Uncompressed CDXJ also allowed. This may use some more refinement.
  2. We've settled on one way to represent pages, via a newline-delimited JSON file pages/pages.jsonl. The main pages.jsonl is required, while additional lists can also be added.

Closing this as I think we've accomplished these goals, thanks @ato!