
A RabbitMQ Scheduler for Scrapy

Primary LanguagePythonMIT LicenseMIT

A RabbitMQ Scheduler for Scrapy Framework.

Scrapy-rabbitmq is a tool that lets you feed and queue URLs from RabbitMQ via Scrapy spiders, using the Scrapy framework.

Inpsired by and modled after scrapy-redis.


Using pip, type in your command-line prompt

pip install scrapy-rabbitmq

Or clone the repo and inside the scrapy-rabbitmq directory, type

python setup.py install


Step 1: In your scrapy settings, add the following config values:

# Enables scheduling storing requests queue in rabbitmq.

SCHEDULER = "scrapy_rabbitmq.scheduler.Scheduler"

# Don't cleanup rabbitmq queues, allows to pause/resume crawls.

# Schedule requests using a priority queue. (default)
SCHEDULER_QUEUE_CLASS = 'scrapy_rabbitmq.queue.SpiderQueue'

# RabbitMQ Queue to use to store requests
RABBITMQ_QUEUE_NAME = 'scrapy_queue'

# Provide host and port to RabbitMQ daemon
RABBITMQ_CONNECTION_PARAMETERS = {'host': 'localhost', 'port': 6666}

# Store scraped item in rabbitmq for post-processing.
    'scrapy_rabbitmq.pipelines.RabbitMQPipeline': 1

Step 2: Add RabbitMQMixin to Spider.

Example: multidomain_spider.py

from scrapy.contrib.spiders import CrawlSpider
from scrapy_rabbitmq.spiders import RabbitMQMixin

class MultiDomainSpider(RabbitMQMixin, CrawlSpider):
    name = 'multidomain'

    def parse(self, response):
        # parse all the things

Step 3: Run spider using scrapy client

scrapy runspider multidomain_spider.py

Step 4: Push URLs to RabbitMQ

Example: push_web_page_to_queue.py

#!/usr/bin/env python
import pika
import settings

connection = pika.BlockingConnection(pika.ConnectionParameters(
channel = connection.channel()

                      body='</html>raw html contents<a href="http://twitter.com/roycehaynes">extract url</a></html>')


Contributing and Forking

See Contributing Guidlines


See the changelog for release details.

Version Release Date
0.1.0 2014-11-14
0.1.1 2015-07-02

Copyright & License

Copyright (c) 2015 Royce Haynes - Released under The MIT License.