webarchiving
There are 46 repositories under webarchiving topic.
iipc/awesome-web-archiving
An Awesome List for getting started with web archiving
akamhy/waybackpy
Wayback Machine API interface & a command-line tool
harvard-lil/warc-gpt
WARC + AI - Experimental Retrieval Augmented Generation Pipeline for Web Archive Collections.
N0taN3rd/Squidwarc
Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head
ArchiveTeam/wget-lua
Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.
N0taN3rd/node-warc
Parse And Create Web ARChive (WARC) files with node.js
machawk1/awesome-memento
A list of things related to software, literature, and other content for 🕣 Memento
peterk/warcworker
A dockerized, queued high fidelity web archiver based on Squidwarc
commoncrawl/cc-notebooks
Various Jupyter notebooks about Common Crawl data
cipher387/quickcacheandarchivesearch
Quick Cache and Archive search buttons
datacoon/metawarc
metawarc: a command-line tool for metadata extraction from files from WARC (Web ARChive)
peterk/munin-indexer
A social media open post web archiving tool
httpreserve/httpreserve
Digital Preservation of HTTP in documentary heritage.
ArchiveTeam/WebArchiver
Decentralized web archiving
ruarxive/awesome-digital-preservation
Awesome list dedicated to digital and data preservation tools, sources, services and so on.
WebarchivCZ/Seeder
Seeder - Czech webarchive curating tool and public site
basenana/nanafs
🗄 File-Based Reference Filing System.
natliblux/warc-safe
A tool for detecting viruses and NSFW material in WARC files
toimik/WarcProtocol
Parser for WARC (aka WebArchive) files
httpreserve/tikalinkextract
Tika based link (URL) extractor for httpreserve
atomotic/pywb-recorder-tor
pywb recorder over tor, anonymously records the web. (docker image)
oduwsdl/tmvis
An archival thumbnail visualization server
News-Archiver/news-archiver
News Archiver, Data Aggregation for CNN and Fox News
atomotic/webrecorder-chrome-extension
record current active tab on webrecorder.io
httpreserve/linkscanner
A helper package to tokenize textual content and retrieve hyperlinks
ArchivingToolsForWBM/AdvancedInternetArchiving
Makes saving pages in bulk to the wayback machine much easier
exponential-decay/moonshine
Given four bytes, download a random file from web archives implementing the UKWA Shine interface
httpreserve/workbench
Client app for httpreserve pkg that generates CSV, JSON, HTTP, and BoltDB
ibnesayeed/awesome-web-archiving
An Awesome List for getting started with web archiving
mgunn001/WebArchiving-SeminarCourse
Class page for ODU CS 791 / 891 Web Archiving Seminar
MozillaCZ/phpbbcrawler
Link crawler for a phpBB forum
TarekJor/wpull
Wget-compatible web downloader and crawler.
athenekilta/arkisto
Digital archive of web pages related to the Guild of Information Networks
ibnesayeed/archival-tests
A set of web archival replay test cases
mijho/crawl-log2xml
Parse a Heritrix crawl.log into an XML sitemap