A collection of scripts to help with various web-archiving tasks.
Contains various scripts for ad-hoc tasks that may or may not be repeated in the future.
Contains scripts relating to browsertrix-crawler
Contains a script to reformat the json response from the Internet Archive's CDX API and provides better duplicate removal. Outputs to a .txt file.
Contains a script to decrypt a folder of PDFs using pikepdf.
Contains various scripts that utilise Preservica's API using pyPreservica.
Contains two scripts. One script produces a plain list of URLs from an XML sitemap (outputs to .txt, .html, or terminal). One script creates a HTML list from a text file input.
A script which reads a folder of WARC files and cross-references the content with a list of URLs. It also uses BS4 to search the HTML content for specific HTML elements.