Pegasus is a highly-modular, durable and scalable crawler for clojure.
Parallelism is achieved with core.async
Durability is achieved with durable-queue
and LMDB.
A blog post on how pegasus works: [link]
Leiningen dependencies:
A few example crawls:
This one crawls 20 docs from my blog (
URLs are extracted using enlive
(:require [pegasus.core :refer [crawl]]
[pegasus.dsl :refer :all])
(:import ( StringReader)))
(defn crawl-sp-blog
(crawl {:seeds [""]
:user-agent "Pegasus web crawler"
:corpus-size 20 ;; crawl 20 documents
:job-dir "/tmp/sp-blog-corpus"})) ;; store all crawl data in /tmp/sp-blog-corpus/
(defn crawl-sp-blog-custom-extractor
(crawl {:seeds [""]
:user-agent "Pegasus web crawler"
:extractor (defextractors
(extract :at-selector [:article :header :h2 :a]
:follow :href
:with-regex #"")
(extract :at-selector [:ul.pagination :a]
:follow :href
:with-regex #""))
:corpus-size 20 ;; crawl 20 documents
:job-dir "/tmp/sp-blog-corpus"}))
Say you want more control and want to avoid the DSL, you can use the underlying machinery directly. Here's an example using XPaths to extract links.
(ns your.namespace
(:require [org.bovinegenius.exploding-fish :as uri]
[net.cgrand.enlive-html :as html]
[pegasus.core :refer [crawl]]
[clj-xpath.core :refer [$x $x:text xml->doc]]))
(deftype XpathExtractor []
[this config]
[this obj config]
(when (= ""
(-> obj :url uri/host))
(let [url (:url obj)
resource (try (-> obj
(catch Exception e nil))
;; extract the articles
articles (map
(try ($x "//item/link" resource)
(catch Exception e nil)))]
;; add extracted links to the supplied object
(merge obj
{:extracted articles}))))
[this config]
(defn crawl-sp-blog-xpaths
(crawl {:seeds [""]
:user-agent "Pegasus web crawler"
:extractor (->XpathExtractor)
:corpus-size 20 ;; crawl 20 documents
:job-dir "/tmp/sp-blog-corpus"}))
;; start crawling
Copyright © 2015-2018 Shriphani Palakodety
Distributed under the Eclipse Public License either version 1.0 or (at your option) any later version.