/hop

A tiny web crawling framework for Elixir

Primary LanguageElixir

Hop

Package Documentation

Hop is a tiny web crawling framework for Elixir.

Installation

Hop is available in Hex:

defp deps do
  [
    {:hop, "~> 0.1"}
  ]
end

Introduction

Hop's goal is to be simple and extensible, while still providing enough guardrails to get up and running quickly. Hop implements a simple, depth-limited, breadth-first crawler - and keeps track of already visited URLs for you. It also provides several utility functions for implementing crawlers with best practices.

You can crawl a webpage with just a few lines of code with Hop:

url
|> Hop.new()
|> Hop.stream()
|> Enum.each(fn {url, _response, _state} ->
  IO.puts("Visited: #{url}")
end)

By default, Hop will perform a single-host, breadth-first crawl of all of the pages on the website. The stream/2 execution function will return a stream of {url, response, state} tuples for each successfully visited page in the crawl. This is designed to make it easy to perform caching, extract items from a page, or do whatever other work deemed necessary.

The defaults are designed to get you up and running quickly; however, the power comes in with Hop's simple extensibility. Hop provides extensibility by allowing you to customize your crawler's behavior at 3 different steps:

  1. Prefetch
  2. Fetch
  3. Next

Prefetch

The prefetch stage is a pre-request stage typically used for request validation. By default, during the prefetch stage, Hop will:

  1. Check that the URL has a populated scheme and host

  2. Check that the URL's scheme is acceptable. Default accepted schemes are http and https. This prevents Hop from attempting to visit tel:, email: and other links with non-HTTP schemes.

  3. Check that the host is actually valid, using :inet.gethostbyname

  4. Check that the host is an acceptable host to visit, e.g. not outside of the host the crawl started on.

  5. Perform a HEAD request to check that the content-length does not exceed the maximum specified content length and to check that the content-type matches one of the acceptable mime-types. This is useful for preventing Hop from spending time downloading large files you don't care about.

Most users should stick to using Hop's default pre-request logic. If you want to customize the behavior; however, you can pass your own prefetch/3 function:

url
|> Hop.new()
|> Hop.prefetch(fn url, _state, _opts -> {:ok, url} end)
|> Hop.stream()
|> Enum.each(fn {url, _response, _state} ->
  IO.puts("Visited: #{url}")
end)

This simple example performs no validation, and forwards all URLs to the fetch stage.

Your custom prefetch/3 function should be an arity-3 function which takes a URL, the current Crawl state, and the current Hop's configuration options as input and returns {:ok, url} or an error value. Any non-{:ok, url} values will be ignored during fetch.

Fetch

The fetch stage performs the actual requests during your crawl. By default, Hop uses Req and performs a GET request without retries on each URL. In other words, the default fetch function in Hop is:

def fetch(url, state, opts) do
  req_options = opts[:req_options]

  with {:ok, response} <- Req.get(url, req_options) do
    {:ok, response, state}
  end
end

Notice you can customize the Req request options by passing :req_options as a part of your Hop's configuration.

This is simple and acceptable for many cases; however, certain applications require more advanced setups and customization options. For example, you might want to configure your fetch to proxy requests through Puppeteer to render javascript. You can do this easily:

def custom_fetch(url, state, _opts) do
  # url to proxy that renders a page with puppeteer
  proxy_url = "http://localhost:3000/render"

  with {:ok, response} <- Req.post(url, json: %{url: url}) do
    {:ok, response, state}
  end
end

Next

The next function dictates the next links to be crawled during execution. It takes as input the current URL, the response from fetch, the current state, and configuration options. By default, Hop returns all links on the current page as next in the queue to be crawled. You can customize this behaviour by implementing your own next/4 function:

def custom_next(url, response, state, _opts) do
  links =
    url
    |> Hop.fetch_links(response)
    |> Enum.reject(&String.contains?(&1, "wp-uploads"))

  {:ok, links, state}
end

This simple example ignores all URLs that contain wp-uploads. Hop provides a convenience fetch_links/2 to fetch all of the absolute URLs on a webpage. This just uses Floki under-the-hood.

License

TODO