chfoo/warcat

URL agnostic deduplication of WARC

Opened this issue · 0 comments

This would be useful for grabs where the exact same images are grabbed with different URLs. There should be a revisit record from an URL to a duplicated URL. Duplicated URLs can be best discovered by comparing the hashes.

This would be used for the flickr Archive Team project. The WARCs would be postprocessed with warcat deduplication.

edit: better explanation of what this would be used for.