URL agnostic deduplication of WARC
Opened this issue · 0 comments
Arkiver2 commented
This would be useful for grabs where the exact same images are grabbed with different URLs. There should be a revisit record from an URL to a duplicated URL. Duplicated URLs can be best discovered by comparing the hashes.
This would be used for the flickr Archive Team project. The WARCs would be postprocessed with warcat deduplication.
edit: better explanation of what this would be used for.