/scrapy-po

Page Object pattern for Scrapy

Primary LanguagePythonBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

scrapy-po

PyPI Version Supported Python Versions Build Status Coverage report

Warning

Current status is "experimental".

scrapy-po implements Page Object pattern for Scrapy.

License is BSD 3-clause.

Installation

pip install scrapy-po

scrapy-po requires Python >= 3.6 and Scrapy 2.0.1+.

Usage

First, enable middleware in your settings.py:

DOWNLOADER_MIDDLEWARES = {
   'scrapy_po.InjectionMiddleware': 543,
}

After that you can write spiders which use page object pattern to separate extraction code from a spider:

import scrapy
from scrapy_po import WebPage


class BookPage(WebPage):
    def to_item(self):
        return {
            'url': self.url,
            'name': self.css("title::text").get(),
        }


class BooksSpider(scrapy.Spider):
    name = 'books'
    start_urls = ['http://books.toscrape.com/']

    def parse(self, response):
        for url in response.css('.image_container a::attr(href)').getall():
            yield response.follow(url, self.parse_book)

    def parse_book(self, response, book_page: BookPage):
        yield book_page.to_item()

TODO: document motivation, the rest of the features, provide more usage examples, explain shortcuts, etc. For now, please check spiders in "example" folder: https://github.com/scrapinghub/scrapy-po/tree/master/example/example/spiders

Contributing

Use tox to run tests with different Python versions:

tox

The command above also runs type checks; we use mypy.