Crawly

Overview

Crawly is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.

Requirements

Elixir ~> 1.14
Works on GNU/Linux, Windows, macOS X, and BSD.

Quickstart

Create a new project: mix new quickstart --sup

Add Crawly as a dependencies:

# mix.exs
defp deps do
    [
      {:crawly, "~> 0.14.0"},
      {:floki, "~> 0.33.0"}
    ]
end

Fetch dependencies: $ mix deps.get

Create a spider

 # lib/crawly_example/books_to_scrape.ex
 defmodule BooksToScrape do
   use Crawly.Spider

   @impl Crawly.Spider
   def base_url(), do: "https://books.toscrape.com/"

   @impl Crawly.Spider
   def init() do
     [start_urls: ["https://books.toscrape.com/"]]
   end

   @impl Crawly.Spider
   def parse_item(response) do
     # Parse response body to document
     {:ok, document} = Floki.parse_document(response.body)

     # Create item (for pages where items exists)
     items =
       document
       |> Floki.find(".product_pod")
       |> Enum.map(fn x ->
         %{
           title: Floki.find(x, "h3 a") |> Floki.attribute("title") |> Floki.text(),
           price: Floki.find(x, ".product_price .price_color") |> Floki.text(),
           url: response.request_url
         }
       end)

     next_requests =
       document
       |> Floki.find(".next a")
       |> Floki.attribute("href")
       |> Enum.map(fn url ->
         Crawly.Utils.build_absolute_url(url, response.request.url)
         |> Crawly.Utils.request_from_url()
       end)

     %{items: items, requests: next_requests}
   end
 end

New in 0.15.0 (not released yet):

It's possible to use the command to speed up the spider creation, so you will have a generated file with all needed callbacks: mix crawly.gen.spider --filepath ./lib/crawly_example/books_to_scrape.ex --spidername BooksToScrape

Configure Crawly

By default, Crawly does not require any configuration. But obviously you will need a configuration for fine tuning the crawls: (in file: config/config.exs)

 import Config

 config :crawly,
   closespider_timeout: 10,
   concurrent_requests_per_domain: 8,
   closespider_itemcount: 100,

   middlewares: [
     Crawly.Middlewares.DomainFilter,
     Crawly.Middlewares.UniqueRequest,
     {Crawly.Middlewares.UserAgent, user_agents: ["Crawly Bot"]}
   ],
   pipelines: [
     {Crawly.Pipelines.Validate, fields: [:url, :title, :price]},
     {Crawly.Pipelines.DuplicatesFilter, item_id: :title},
     Crawly.Pipelines.JSONEncoder,
     {Crawly.Pipelines.WriteToFile, extension: "jl", folder: "/tmp"}
   ]

New in 0.15.0 (not released yet):

You can generate example config with the help of the following command: mix crawly.gen.config

Start the Crawl:

  iex -S mix run -e "Crawly.Engine.start_spider(BooksToScrape)"

Results can be seen with:

$ cat /tmp/BooksToScrape_<timestamp>.jl

Running Crawly as a standalone application

It's possible to run Crawly as a standalone application for the cases when you just need the data and don't want to install Elixir and all other dependencies.

Follow these steps in order to bootstrap it with the help of Docker:

Make a project folder on your filesystem: mkdir standalone_quickstart
Create a spider inside the folder created on the step 1. Ideally in a subfolder called spiders. For the example purposes we will re-use the: https://github.com/elixir-crawly/crawly/blob/8926f41df3ddb1a84099543293ec3345b01e2ba5/examples/quickstart/lib/quickstart/books_spider.ex

Create a configuration file (erlang configuration file format), for example:

  [{crawly, [
      {closespider_itemcount, 500},
      {closespider_timeout, 20},
      {concurrent_requests_per_domain, 2},

      {middlewares, [
              'Elixir.Crawly.Middlewares.DomainFilter',
              'Elixir.Crawly.Middlewares.UniqueRequest',
              'Elixir.Crawly.Middlewares.RobotsTxt',
              {'Elixir.Crawly.Middlewares.UserAgent', [
                  {user_agents, [
                      <<"Mozilla/5.0 (Macintosh; Intel Mac OS X x.y; rv:42.0) Gecko/20100101 Firefox/42.0">>,
                      <<"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36">>
                      ]
                  }]
              }
          ]
      },

      {pipelines, [
              {'Elixir.Crawly.Pipelines.Validate', [{fields, [title, price, title, url]}]},
              {'Elixir.Crawly.Pipelines.DuplicatesFilter', [{item_id, title}]},
              'Elixir.Crawly.Pipelines.Experimental.Preview',
              {'Elixir.Crawly.Pipelines.JSONEncoder'}
          ]
      }]
  }].

** TODO - it would be nice to switch it to human readable format, e.g. YML

Now it's time to the Docker container: bash docker run -e "SPIDERS_DIR=/app/spiders" -it -p 4001:4001 -v $(pwd)/spiders:/app/spiders -v $(pwd)/crawly.config:/app/config/crawly.config crawly:latest
Now you can open the management interface and manage your spiders from there: localhost:4001. Management Interface

Need more help?

Please use discussions for all conversations related to the project

Browser rendering

Crawly can be configured in the way that all fetched pages will be browser rendered, which can be very useful if you need to extract data from pages which has lots of asynchronous elements (for example parts loaded by AJAX).

You can read more here:

Browser Rendering

Simple management UI (New in 0.15.0) {#management-ui}

Crawly provides a simple management UI by default on the localhost:4001

It allows to:

Start spiders
Stop spiders
Preview scheduled requests
Preview items extracted so far (it's required to add the Crawly.Pipelines.Experimental.Preview item pipe to have items preview)

Experimental UI

The CrawlyUI project is an add-on that aims to provide an interface for managing and rapidly developing spiders. Checkout the code from GitHub

Documentation

Roadmap

To be discussed

Articles

Blog post on Erlang Solutions website: https://www.erlang-solutions.com/blog/web-scraping-with-elixir.html
Blog post about using Crawly inside a machine learning project with Tensorflow (Tensorflex): https://www.erlang-solutions.com/blog/how-to-build-a-machine-learning-project-in-elixir.html
Web scraping with Crawly and Elixir. Browser rendering: https://medium.com/@oltarasenko/web-scraping-with-elixir-and-crawly-browser-rendering-afcaacf954e8
Web scraping with Elixir and Crawly. Extracting data behind authentication: https://oltarasenko.medium.com/web-scraping-with-elixir-and-crawly-extracting-data-behind-authentication-a52584e9cf13
What is web scraping, and why you might want to use it?
Using Elixir and Crawly for price monitoring
Building a Chrome-based fetcher for Crawly

Example projects

Blog crawler: https://github.com/oltarasenko/crawly-spider-example
E-commerce websites: https://github.com/oltarasenko/products-advisor
Car shops: https://github.com/oltarasenko/crawly-cars
JavaScript based website (Splash example): https://github.com/oltarasenko/autosites

Contributors

We would gladly accept your contributions!

Documentation

Please find documentation on the HexDocs

Production usages

Using Crawly on production? Please let us know about your case!

Copyright and License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

How to release:

Update version in mix.exs
Update version in quickstart (README.md, this file)
Commit and create a new tag: git commit && git tag 0.xx.0 && git push origin master --follow-tags
Build docs: mix docs
Publish hex release: mix hex.publish

serpent213/crawly

Crawly

Overview

Requirements

Quickstart

Running Crawly as a standalone application

Need more help?

Browser rendering

Simple management UI (New in 0.15.0) {#management-ui}

Experimental UI

Documentation

Roadmap

Articles

Example projects

Contributors

Documentation

Production usages

Copyright and License

How to release: