Awesome-DataHoarding

Note: This is only a first draft/brainstorm. I will try to organize the list with more useful sections in the future
Feel free to contribute!

Download utilities
Backup
Compression
Network
File systems
File conversion
Utility Scripts
Content sharing
Data curation
APIs & Online tools
Hardware / Monitoring
Data recovery
Local Media
Long-term data archiving

Download utilities

^ back to top ^

Web Archiving

ArchiveBox: The open source self-hosted web archive. Takes browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
Browsertrix Crawler: Browsertrix Crawler is a simplified (Chrome) browser-based high-fidelity crawling system, designed to run a complex, customizable browser-based crawl in a single Docker container
Collect: A server to collect & archive websites that also supports video downloads
grab-site: The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Heritrix: Extensible, web-scale, archival-quality web crawler
HTTrack: Download a website from the Internet to a local directory
wail: Web Archiving Integration Layer: One-Click User Instigated Preservation
webrecorder: An integrated platform for creating high-fidelity, ISO-compliant web archives in a user-friendly interface, providing access to archived content, and sharing collections
wikiteam: set of tools for archiving wikis

General

annie: Youtube-DL alternative written in Golang
aria2: A lightweight multi-protocol & multi-source command-line download utility
CrowLeer: Powerful C++ web crawler based on libcurl
curl: Tool and library for transferring data with URL syntax, supporting many protocols
Horahora: Video hosting website and video archival manager for Niconico, Bilibili, and Youtube
httpie: a tool similar to curl and wget but designed to be user friendly, useful for web scraping with shell scripts but be aware you're adding a dependency by doing so.
news-crawl: Cralwer for news feeds based on StromCrawler that prouduces WARC files.
Plowshare: Command-line tool to manage file-sharing site
Rclone: A command line program to sync files and directories to and from various cloud storage providers
rsync: An open source utility that provides fast incremental file transfer
Suck-It: Recursively visit and download a website's content to your disk (multi-threaded)
wget: Utility for non-interactive download of files from the Web.
wget2: Successor of GNU Wget, works multi-threaded
wpull: Wget-compatible web downloader and crawler
you-get: Dumb downloader that scrapes the web
Youtube-DL: A command-line program to download videos from YouTube and a few hundred more sites
ytdl-sub: Automate downloading and metadata generation with YoutubeDL

Application-specific

BBCSoundDownloader: Bulk downloader for BBC's Sound Effects library http://bbcsfx.acropolis.org.uk/
ChanThreadWatch: Saves threads from *chan-style boards and checks for updates until the thread dies
comics-downloader: Command-line tool to download comicsand manga in pdf/epub/cbz/cbr from supported sites
floatplane_ripper: Script to rip all videos from https://floatplane.rip/
gallery-dl: Download image galleries and collections from pixiv, exhentai, danbooru and more
Discord-Channel-Scraper: Discord server archival (json output, download attachments and emojies)
dzi-dl: Deep Zoom Image Downloader
FanFicFare: Tool for making eBooks from stories on fanfiction and other web sites
FicSave: Online fanfiction downloader
flickr_download: Simple script to download a Flickr set.
Google Images Download: Python script for downloading images
iiif-dl: Command-line tile downloader/assembler for IIIF endpoints/manifests
imgbrd-grabber: Very customizable imageboard/booru downloader with powerful filenaming features.
instaloader: Download pictures (or videos) along with their captions and other metadata from Instagram
InstaLooter: API-less Instagram pictures and videos downloader.
Instagram Scraper: Instagram-scraper is a command-line application written in Python that scrapes and downloads an instagram user's photos and videos. Use responsibly.
PyInstaLive: Instagram live stream downloader.
RedditDownloader: Scrapes Reddit to download media of your choice
Scribd-Downloader: Allows downloading of Scribd documents
snscrape: A social networking service scraper in Python
RipMe: RipMe is an album ripper for various websites. Runs on your computer. Requires Java 8.
tumblr-utils: Utilities for dealing with Tumblr blogs, Tumblr backup.
yt-mango: Youtube metadata archiver the Web (HTTP & FTP)
Youtube-MA: Youtube metadata archiver

Download automation

bazarr: Companion application to Sonarr and Radarr for downloading subtitles
FlexGet: Multipurpose automation tool for content like torrents, nzbs, podcasts, comics, series, movies, etc
Jackett: API support for torrent trackers (works with Sonarr, Radarr and others)
Lidarr: Music collection manager for Usenet and BitTorrent users
Mylar: An automated Comic Book downloader (cbr/cbz) for use with SABnzbd, NZBGet and torrents
Sick-Beard: PVR for newsgroup users (with limited torrent support)
Radarr: A fork of Sonarr to work with movies à la Couchpotato
Sonarr: PVR for Usenet and BitTorrent users

Backup

^ back to top ^

BorgBackup: Deduplicating archiver with compression and encryption

Compression

^ back to top ^

7-Zip: A file archiver with a high compression ratio
KGB Archiver: compression tool with unbelievable high compression rate
peazip: File archiver utility
PIGZ: Multi-threaded gzip
WinRAR: Can decompress RAR and zip files.

Network

^ back to top ^

NetLimiter: Internet traffic control and monitoring tool for Windows

File systems

^ back to top ^

httpdirfs: A filesystem which allows you to mount HTTP directory listings
mergerfs: a featureful union filesystem
NTFS drivers for MacOS

File conversion

^ back to top ^

AAXtoMP3: convert AAX files to common MP3, M4A, M4B, flac and ogg formats through a basic bash script frontend to FFMPEG
html2warc: Convert web resources to a single warc file
warcat: Tool and library for handling Web ARChive (WARC) files

Utility Scripts

^ back to top ^

Backblaze B2 sync backup script: Script to sync mutliple directories with Backblaze B2
flac2mp3_V0.py : Multi-threaded python script to convert all flac files to mp3 V0 while keeping the directory structure
Misc download scripts: Scripts for downloading content from various websites
TheFrenchGhosty's Ultimate YouTube-DL Scripts Collection: Collection of youtube-dl scripts to aid in YouTube channel archival
rclone_dirsize: Get size of http directory listing with rclone
rm_empty_subdir: Remove empty sub-directories on Windows
void-cat-uploader: This script automatically uploads all files inside a directory to https://void.cat.
youtube-dl_soundcloud: snippet for using youtube-dl to download soundcloud playlists

Content sharing

^ back to top ^

h5ai: HTTP web server index for Apache httpd, lighttpd, nginx and Cherokee
ipfs: Protocol and network designed to create a content-addressable, peer-to-peer method of storing and sharing hypermedia in a distributed file system
opds: Easy to use, Open & Decentralized Content Distribution
Syncthing: An application that lets you synchronize your files across multiple devices

Data curation

^ back to top ^

baobab: Graphical disk usage analyzer
beets: Music library manager and MusicBrainz tagger
browsemonkey: Takes snapshots of file systems for offline browsing and searching.
Calibre: Ebook manager
DataCurator-Filetree: A unified filetree for all kinds of data, which should help in storing, categorising and retrieving
DeepSort: AI powered image tagger backed by DeepDetect
diskover: File system crawler, disk space usage, file search engine and file system analytics powered by Elasticsearch
Everything: Locate files and folders by name instantly (Windows)
FileBot: FileBot is the ultimate tool for organizing and renaming your Movies, TV Shows and Anime
fucking-weeb: A library manager for animu (and TV shows, and whatever).
grepWin: A powerful and fast search tool using regular expressions (Windows)
Hydrus: A desktop application for large media collections
Kiwix: An offline reader for online content like Wikipedia, Project Gutenberg, or TED Talks
jdupes: Powerful duplicate file finder
MediaElch: Media manager for Kodi
MediaInfo: Convenient unified display of the most relevant technical and tag data for video and audio files
Mp3tag: Powerful and easy-to-use tool to edit metadata of audio files (Windows/Mac)
phockup: Media sorting tool to organize photos and videos from your camera
picard: MusicBrainz tagger
TeraCopy: Copy your files faster and more securely
tree: 'tree' command for linux
WinDirStat: Disk usage statistics viewer and cleanup tool for Windows
WizTree: Finds the files and folders using the most disk space on your hard drive
sist2: Lightning-fast file system indexer and search tool
SyncToy: Microsoft windows file parity across locations tool
VisiPics: Automatically finds duplicated images

APIs & Online tools

^ back to top ^

iqdb: Multi-service reverse image search
thetvdb: TV shows metadata (used by plex)

Hardware / Monitoring

^ back to top ^

CrystalDiskInfo: A HDD/SSD utility software which supports a part of USB, Intel RAID and NVMe.
Hard Drive Sentinel: Multi-OS SSD and HDD monitoring and analysis software
smartmontools: Control and monitor storage systems using the (SMART) built into most modern ATA/SATA, SCSI/SAS and NVMe disks

Data recovery

^ back to top ^

PhotoRec FOSS powerful gui data recovery tool.
TestDisk Another FOSS tool by the author of PhotoRec, but this one is for cli

Local Media

^ back to top ^

whipper: Python CD-DA ripper preferring accuracy over speed. Generates .flac, .cue, and .log by default and automatically fetches metadata from musicbrainz. EAC log plugin is available.
Exact Audio Copy: A freeware, Windows only application similar to the above that doesn't automatically fetch metadata by default, but EAC rips are preferred by most trackers.
MakeMKV: A cross-platform DVD ripper that supports recent blu ray DVDs. It's mostly open source, but the blu ray secret sauce is still hidden.
Handbrake: Open source DVD ripper and media trascoder. Has more options and features than the above, but it cannot rip blu ray discs.

Long-term data archiving

^ back to top ^

CommonCrawl: Data collected over seven years (ongoing) which contains web page data, extracted metadata and text extractions.
Blockyarchive: Archive with forward error correction and sector level recoverability
par2cmdline: A PAR 2.0 compatible file verification and repair tool

EpicLPer/awesome-datahoarding

Awesome-DataHoarding

Download utilities

Web Archiving

General

Application-specific

Download automation

Backup

Compression

Network

File systems

File conversion

Utility Scripts

Content sharing

Data curation

APIs & Online tools

Hardware / Monitoring

Data recovery

Local Media

Long-term data archiving