srarchive (subreddit archiver)

Note: Reddit is no longer supporting the timestamp search function, so searching beyond the initial ~1000 posts is no longer supported. For more information about this frankly insane downgrade, see: https://www.reddit.com/r/changelog/comments/7tus5f/update_to_search_api/

Script for archiving a subreddit. Supports resuming and stopping.

Uses Python 3 and the requests library.

Usage

$ ./srarchive.py -h
usage: srarchive.py [-h] [-u username] [-p password] [-i id] [-s secret]
                    [-a useragent] [-o filename] [-n] [--pprint fmtstr]
                    [--json] [--resume name/time] [--stop name/time]
                    [--auth-file filename] [--sleep SLEEP] [--force]
                    subreddit

Retreives entire content of subreddit.

positional arguments:
  subreddit             subreddit to archive

optional arguments:
  -h, --help            show this help message and exit
  -u username           reddit username
  -p password           reddit password
  -i id                 reddit API ID
  -s secret             reddit API secret
  -a useragent          useragent for bot
  -o filename           output listings to file
  -n                    dont output anything
  --pprint fmtstr       pretty print format string
  --json                output as json
  --resume name/time    name or time to resume at
  --stop name/time      name or time to stop at
  --auth-file filename  file with authentication info
  --sleep SLEEP         time to sleep between requests
  --force               dont prompt for file append

The easiest way to get started, is to put all authentication information into a file, e.g., auth.json, and then use the --auth-file argument, like so:

$ cat auth.json
{
    "username": "reddit username",
    "password": "reddit password",
    "api_secret": "api secret",
    "api_id": "api ID",
    "useragent": "user-agent to use"
}
$ ./srarchive.py --auth-file auth.json funny
[ ] will output using "str"
[ ] Authenticated
...

It is possible to control the format of the output, by using --json or --pprint <fmtstr>. For example:

$ ./srarchive.py --auth-file auth.json --pprint '{title}' funny 2>/dev/null
Furnace goes out, as soon as I got it running again, I find where the car has run to. Warmest place in the house and best heating grate.
He Knows
Yummy...
Spilled Methanol fuel attracts Angry Bees.
The greatest idea of all time
...

Note that status messages goes to stderr.

Output

Output is can be controlled in two ways: the format of the data, and where it goes.

Output formatting

Using the --pprint switch, it is possible to format the output using Python’s format strings. That is, --pprint fmtstr means we will be running

fmtstr.format(**entry)

on each entry (i.e. post on the subreddit in question).

For example, to extract the title, we can use

--pprint '{title}'

To extract the URL we can use

--pprint '{url}'

It is, of course possible to write more complex format strings. E.g.

--pprint '{author} got {score} upvotes for "{title}"'

to print both OP, number of upvotes and the title of a post.

In addition to formatted output, it is also possible to simply output an entry as JSON. This is done by supplying the --json flag.

Output to a file

By default output goes to stdout. It is possible to entirly turn off output (e.g. for the sake of debugging) by using the -n option.

It is possible to direct output to a file, either by using shell redirection or by using the -o flag. The difference between

./srarchive.py --auth-file auth.json -o somefile funny

and

./srarchive.py --auth-file auth.json 2>/dev/null > somefile

Is that, in the case somefile already exists (for example, is the result of a previous run) the former will prompt you for whether or not results should be appended to somefile. That is

$ touch somefile
$ ./srarchive.py --auth-file auth.json funny -o somefile
[ ] will output using "str"
[W] file somefile exists. Append to file? (y/n)

It is possible to skip the prompt by supplying the --force flag.

Note that each entry is terminated by a newline—the number of lines in somefile will correspond to the number of posts that have been fetched from the subredddit.

Other options

There’s a couple of other options avaliable:

sleep

The --sleep t option will make the script sleep t seconds between each request. The default value is 1.

resume

Resumption can be controlled by the --resume v argument, where v is either a UNIX timestamp or a fullname (A fullname is a string of the form t3_base36 data. See description here.)

stop early

Stopping early is also possible: Much as with resumption, you can use --stop v (v again being a UNIX timestamp or fullname) to specify when the script should stop.

Misc

Why don’t you use PRAW?

The script only needs to query /new, /search and maybe /about, so the need for a full Reddit API wrapper seemed a bit overkill.

anderspkd/srarchive