/httpdisk

Primary LanguageRubyMIT LicenseMIT

Build Status

logo

httpdisk

httpdisk is an aggressive disk cache built on top of Faraday. It's primarily used for crawling, and will aggressively cache all requests including POSTs and transient errors.

Installation

# install gem
$ gem install httpdisk

# or add to your Gemfile
gem 'httpdisk'

Quick Start

require 'httpdisk'

# create a new Faraday client
faraday = Faraday.new do
  _1.use :httpdisk
end

response = faraday.get('https://google.com') # read from network
response = faraday.get('https://google.com') # read from ~/httpdisk/google.com/...

httpdisk includes a handy command that works like curl:

# cache miss, read from network
$ httpdisk google.com

# cache hit, read from ~/httpdisk/google.com/...
$ httpdisk google.com

# supports many curl flags
$ httpdisk -A test-agent --proxy localhost:8080 --output tmp.html twitter.com

Faraday & httpdisk

Faraday is a popular Ruby HTTP client. Faraday uses a stack of middleware to process each request, similar to the way Rack works deep inside Rails or Sinatra. httpdisk is Faraday middleware - it processes requests to look for cached responses on disk. Faraday's usage page is a good place to learn more about Faraday.

The simplest possible setup for httpdisk looks like this:

faraday = Faraday.new do
  _1.use :httpdisk
end
faraday.get(...)

For serious crawling, you probably want a more robust middleware stack:

faraday = Faraday.new do
  _1.options.timeout = 10 # lower the timeout
  _1.use :cookie_jar # cookie support
  _1.request :url_encoded # auto-encode form bodies
  _1.response :json # auto-decode JSON responses
  _1.response :follow_redirects # follow redirects (should be above httpdisk)
  _1.use :httpdisk
  _1.request :retry # retry failed responses (should be below httpdisk)
end
faraday.get(...)

You may want to experiment with the options for :retry, to retry a broader set of transient errors. See examples.rb for more ideas.

Disk Cache

httpdisk calculates a canonical cache key for each request. The key consists of the http method, url, sorted query, and sorted body if possible. We use md5(key) as the path for each file in the cache. Try httpdisk --status to see it in action:

$ httpdisk --status "google.com?q=ruby"
url: "http://google.com/?q=ruby"
status: "miss"
key: "GET http://google.com?q=ruby"
digest: "0e37f96800a55958fa6029283c78f672"
path: "httpdisk/google.com/0e3/7f96800a55958fa6029283c78f672"

EVERY response will be cached on disk, including POSTs. By default, the cache will be placed at ~/httpdisk and cached responses never expire. Some examples:

faraday.get("http://www.google.com", nil, { "User-Agent": "test-agent" })
faraday.get("http://www.google.com", { "q": "ruby" })
faraday.post("http://httpbin.org/post", "name=hello")

This will populate the cache:

$ cd ~/httpdisk
$ find . -type f
./google.com/5eb/fc70198242876f5e83a67253663e9
./google.com/6d0/52ac9a33d25065fc9f405100f3741
./httpbin.org/88f/7b2bc35cc3759c9905c4de1dbf981

$ gzcat google.com/5eb/fc70198242876f5e83a67253663e9
# GET http://www.google.com
HTTPDISK 200 OK
date: Mon, 19 Apr 2021 18:40:01 GMT
expires: -1
cache-control: private, max-age=0
...

Aggressive Caching

httpdisk caches all responses. POST responses are cached, along with 500 responses and other HTTP errors. HTTP response headers that typically control caching are completely ignored. We also cache many exceptions like connection refused, timeout, ssl error, etc. These are returned as responses with HTTP status code 999.

In general, if you make a request it will be cached regardless of the outcome.

String Encoding

httpdisk will honor the Content-Type from responses. Unfortunately, it is entirely possible to get invalid bodies if the Content-Type doesn't match the bytes. This is a major bummer, so httpdisk provides a utf8: option that forces text response bodies to UTF-8.

Configuration

httpdisk supports a few options:

  • dir: location for disk cache, defaults to ~/httpdisk
  • expires: when to expire cached requests, default is nil (never expire)
  • force: don't read anything from cache (but still write)
  • force_errors: don't read errors from cache (but still write)
  • ignore_params: array of query params to ignore when calculating cache_key
  • logger: log requests to stderr, or pass your own logger
  • utf8: if true, force text response bodies to valid UTF-8

Pass these in when setting up Faraday:

faraday = Faraday.new do
  _1.use :httpdisk, expires: 7*24*60*60, force: true
end

Command Line

The httpdisk command works like curl and supports some of curl's popular flags. Exit code 1 indicates an HTTP response code >= 400 or a failed request.

$ httpdisk --help
httpdisk [options] [url]
Similar to curl:
    -d, --data        HTTP POST data
    -H, --header      pass custom header(s) to server
    -i, --include     include response headers in the output
    -m, --max-time    maximum time allowed for the transfer
    -o, --output      write to file instead of stdout
    -x, --proxy       use host[:port] as proxy
    -X, --request     HTTP method to use
    --retry           retry request if problems occur
    -s, --silent      silent mode (don't print errors)
    -A, --user-agent  send User-Agent to server
Specific to httpdisk:
    --dir             httpdisk cache directory (defaults to ~/httpdisk)
    --expires         when to expire cached requests (ex: 1h, 2d, 3w)
    --force           don't read anything from cache (but still write)
    --force-errors    don't read errors from cache (but still write)
    --status          show status for a url in the cache

Goodies: httpdisk-grep

The httpdisk-grep command makes it easy to search your cache directory. It can be challenging to use grep/ripgrep because cache files are compressed and JSON bodies often lack newlines. httpdisk-grep is the right tool for the job. See httpdisk-grep --help.

An alternative is to use ripgrep-all with the --rga-accurate flag. Ripgrep-all works well for large caches, though it lacks some of the niceties of httpdisk-grep.

Limitations & Gotchas

  • Transient errors are cached. This is appropriate for many uses cases (like crawling) but can be confusing. Use httpdisk --status to debug.
  • There are no builtin mechanisms to cleanup or limit the size of the cache. Use rm
  • For best results the :follow_redirects middleware should be listed above httpdisk. That way each redirect request will be cached.
  • For best results the :retry middleware should be listed below httpdisk. That way retries will complete before we cache.
  • httpdisk does not work with Faraday's parallel mode or on_complete.

Changelog

1.0

  • support faraday 2, minimum Ruby is 3.1 now
  • moved to Justfile and Standard

0.5

  • honor Content-Type
  • added :utf8 option to force text-like response bodies to UTF-8

0.4

  • added httpdisk-grep for searching cache files
  • added HTTPDisk::Cache#delete
  • rename :expires_in to :expires

0.3

  • added :ignore_params, for ignoring query params when generating cache keys
  • HTTP 40x & 50x responses return :error status (and respond to force_error)

0.2 - May 2020

  • added response.env[:httpdisk], which will be true if the response came from the cache
  • added :logger option
  • rake rubocop

0.1 - April 2020

  • Original release