/scrapy-poet

Page Object pattern for Scrapy

Primary LanguagePythonBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

scrapy-poet

PyPI Version Supported Python Versions Build Status Coverage report

Warning

Current status is "experimental".

scrapy-poet implements Page Object pattern for Scrapy.

License is BSD 3-clause.

Installation

pip install scrapy-poet

scrapy-poet requires Python >= 3.6 and Scrapy 2.1.0+.

Usage

First, enable middleware in your settings.py:

DOWNLOADER_MIDDLEWARES = {
   'scrapy_poet.InjectionMiddleware': 543,
}

After that you can write spiders which use page object pattern to separate extraction code from a spider:

import scrapy
from web_poet.pages import WebPage


class BookPage(WebPage):

    def to_item(self):
        return {
            'url': self.url,
            'name': self.css("title::text").get(),
        }


class BooksSpider(scrapy.Spider):

    name = 'books'
    start_urls = ['http://books.toscrape.com/']

    def parse(self, response):
        links = response.css('.image_container a')
        yield from response.follow_all(links, self.parse_book)

    def parse_book(self, response, book_page: BookPage):
        yield book_page.to_item()

TODO: document motivation, the rest of the features, provide more usage examples, explain shortcuts, etc. For now, please check spiders in "example" folder: https://github.com/scrapinghub/scrapy-poet/tree/master/example/example/spiders

Contributing

Use tox to run tests with different Python versions:

tox

The command above also runs type checks; we use mypy.