Note: Reddit is no longer supporting the timestamp search function, so searching beyond the initial ~1000 posts is no longer supported. For more information about this frankly insane downgrade, see: https://www.reddit.com/r/changelog/comments/7tus5f/update_to_search_api/
Script for archiving a subreddit. Supports resuming and stopping.
Uses Python 3 and the requests library.
$ ./srarchive.py -h usage: srarchive.py [-h] [-u username] [-p password] [-i id] [-s secret] [-a useragent] [-o filename] [-n] [--pprint fmtstr] [--json] [--resume name/time] [--stop name/time] [--auth-file filename] [--sleep SLEEP] [--force] subreddit Retreives entire content of subreddit. positional arguments: subreddit subreddit to archive optional arguments: -h, --help show this help message and exit -u username reddit username -p password reddit password -i id reddit API ID -s secret reddit API secret -a useragent useragent for bot -o filename output listings to file -n dont output anything --pprint fmtstr pretty print format string --json output as json --resume name/time name or time to resume at --stop name/time name or time to stop at --auth-file filename file with authentication info --sleep SLEEP time to sleep between requests --force dont prompt for file append
The easiest way to get started, is to put all authentication information into a file, e.g., auth.json
, and then use the --auth-file
argument, like so:
$ cat auth.json { "username": "reddit username", "password": "reddit password", "api_secret": "api secret", "api_id": "api ID", "useragent": "user-agent to use" } $ ./srarchive.py --auth-file auth.json funny [ ] will output using "str" [ ] Authenticated ...
It is possible to control the format of the output, by using --json
or --pprint <fmtstr>
. For example:
$ ./srarchive.py --auth-file auth.json --pprint '{title}' funny 2>/dev/null Furnace goes out, as soon as I got it running again, I find where the car has run to. Warmest place in the house and best heating grate. He Knows Yummy... Spilled Methanol fuel attracts Angry Bees. The greatest idea of all time ...
Note that status messages goes to stderr
.
Output is can be controlled in two ways: the format of the data, and where it goes.
Using the --pprint
switch, it is possible to format the output using Python’s format strings.
That is, --pprint fmtstr
means we will be running
fmtstr.format(**entry)
on each entry (i.e. post on the subreddit in question).
For example, to extract the title, we can use
--pprint '{title}'
To extract the URL we can use
--pprint '{url}'
It is, of course possible to write more complex format strings. E.g.
--pprint '{author} got {score} upvotes for "{title}"'
to print both OP, number of upvotes and the title of a post.
In addition to formatted output, it is also possible to simply output an entry as JSON.
This is done by supplying the --json
flag.
By default output goes to stdout
.
It is possible to entirly turn off output (e.g. for the sake of debugging) by using the -n
option.
It is possible to direct output to a file, either by using shell redirection or by using the -o
flag.
The difference between
./srarchive.py --auth-file auth.json -o somefile funny
and
./srarchive.py --auth-file auth.json 2>/dev/null > somefile
Is that, in the case somefile
already exists (for example, is the result of a previous run) the former will
prompt you for whether or not results should be appended to somefile
. That is
$ touch somefile $ ./srarchive.py --auth-file auth.json funny -o somefile [ ] will output using "str" [W] file somefile exists. Append to file? (y/n)
It is possible to skip the prompt by supplying the --force
flag.
Note that each entry is terminated by a newline—the number of lines in somefile
will correspond to the number of posts that have been fetched from the subredddit.
There’s a couple of other options avaliable:
The --sleep t
option will make the script sleep t
seconds between each request.
The default value is 1.
Resumption can be controlled by the --resume v
argument, where v
is either a UNIX timestamp or a fullname
(A fullname is a string of the form t3_base36
data. See description here.)
Stopping early is also possible: Much as with resumption, you can use --stop v
(v
again being a UNIX timestamp or fullname)
to specify when the script should stop.
The script only needs to query /new
, /search
and maybe /about
, so the need for a
full Reddit API wrapper seemed a bit overkill.