web-archiving-scripts

A collection of scripts to help with various web-archiving tasks.

archived scripts

Contains various scripts for ad-hoc tasks that may or may not be repeated in the future.

browsertrix-crawler files

Contains scripts relating to browsertrix-crawler

downloading items from the Internet Archive

Contains a script to reformat the json response from the Internet Archive's CDX API and provides better duplicate removal. Outputs to a .txt file.

pdf decrypt

Contains a script to decrypt a folder of PDFs using pikepdf.

pypreservica scripts

Contains various scripts that utilise Preservica's API using pyPreservica.

sitemap tools

Contains two scripts. One script produces a plain list of URLs from an XML sitemap (outputs to .txt, .html, or terminal). One script creates a HTML list from a text file input.

warc_reader.py

A script which reads a folder of WARC files and cross-references the content with a list of URLs. It also uses BS4 to search the HTML content for specific HTML elements.

craiglmccarthy/web-archiving-scripts