Reliably scrape multiple subreddits and users for multiple file formats.
https://github.com/D3vd/Reddit_Image_Scraper
This version well-supersedes the template created previously, with MANY new features.
- Auto-blacklisting low-quality images
- Auto-blacklisting dead links
- User-defined query timeout (how long will you wait between each query?)
- User-defined API error timeout (this seems to help overall speed)
- User-defined query quantity (How many queries per category per sub?)
- User-defined minimum file size (to blacklist and delete after downloading)
- De-duplication of downloaded files (It will never download the same file twice)
- Puts files in respective folders
- Logging of progress, all files downloaded
- Logs the time it takes per sub, per category
And best of all, it's VERY EASY to setup.
Make sure to have installed these libraries before executing the program.
- PRAW
- ConfigParser
- Urllib
- Blake3
- Run the program once. It will create the source files you need to get started.
- Go to this link
- Press the Create an app button on the bottom.
- Give a name, and description for your app.
- Choose 'Script' in the app type section.
- Put the client ID and Secret in config.ini
- Add some subreddits to your subs.txt
- run python3 reddit_image_scraper.py.
- Check the ./result directory for your images!
- Check the ./logs folder for history / troubleshooting on your recent runs.
Write some warnings here soon for best practices.
- Don't run more than one at a time. Your API key will get rate-limited and both may go even slower.
- DO NOT SHARE your API keys, or upload them anywhere public! Don't upload them to github, either! Treat them like a username/password.
Crontab entry for you if you like:
Runs once a day at 00:00 UTC.
00 00 * * * cd /path/to/script/Reddit_Image_Scraper-master && python3 Reddit_image_scraper.py