This is my attempt at rescueing as much from my favorite subreddits as possible.
The Web Archive has also a running archiving attempt over at /r/DataHoarder, but the Archive Warrior just scrapes HTML which is pretty much useless for my OSINT related work.
This tool here is built on the basis of the old v1
API on reddit, which downloads
and stores all the JSON
files directly, so that they can be processed later.
-
Each listing (hot/top/new) is limited to 10 pages of 100 results each (1000 results), which means that the discovery of older threads is only possible via keyword search.
-
Keyword search is also limited to 1000 results, which means the more specific the keywords, the better the discovery.
The keywords are set inside the keywords.json file, and the subreddit is searched for the given set of keywords.
The script I was/am using to archive the cybersecurity related subreddits is the scrape.sh script. It builds the binary and then calls the binary with each time with the subreddit as an argument.
Please make sure to use the correct upper/lowercase writing of the subreddit's name, otherwise the redirects might break the scraping mechanism.
go build -o ./build/reddit-archivar ./cmds/reddit-archivar/main.go;
cp keywords.json ./build/keywords.json;
cd ./build && reddit-archivar /r/MalwareResearch;
These subreddits went private too early, so I couldn't archive them :(
- /r/security
AGPL-3.0