scrapy-plugins/scrapy-playwright

Request Retry on Page Error

Ehsan-U opened this issue · 1 comments

I know scrapy support retries against HTTP statuses but i came across an issue where i can get the page on second retry, the error that raised is not HTTP, its page related error but Scrapy-Playwright has no way to retry the request in such cases. It would be great to have an option to retry when particular type of error occured. like in this case TimeoutError raised if selector not found.

def start_requests(self) -> Iterable[Request]:
        url = f"https://www.elliman.com/offices/usa"
        yield scrapy.Request(url, callback=self.parse, meta={
            "playwright": True, 
            "playwright_page_methods": [
                PageMethod("wait_for_selector", "//div[contains(@id, 'brokeritem')]", state="attached")
            ],
            "dont_redirect": True
        })

Sounds like a job for a Downloader Middleware, specifically the process_exception method.

The following snippet is adapted from the example included in https://github.com/scrapy-plugins/scrapy-playwright/blob/v0.0.41/examples/exception_middleware.py:

import logging

from scrapy import Spider, Request
from scrapy_playwright.page import PageMethod


class HandleTimeoutMiddleware:
    def process_exception(self, request, exception, spider):
        logging.info(
            "Caught exception: %s for request %s, retrying",
            exception.__class__,
            request,
        )
        return Request(
            url=request.url,
            meta={"playwright": True},
            dont_filter=True,
        )


class HandleExceptionInMiddlewareSpider(Spider):
    name = "exception"
    custom_settings = {
        "DOWNLOADER_MIDDLEWARES": {
            HandleTimeoutMiddleware: 100,
        },
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
        "DOWNLOAD_HANDLERS": {
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
    }

    def start_requests(self):
        yield Request(
            url="https://example.org",
            meta={
                "playwright": True,
                "playwright_page_methods": [
                    PageMethod(
                        "wait_for_selector",
                        "//div[contains(@id, 'asdf')]",
                        timeout=100,
                    )
                ],
            },
        )

    def parse(self, response, **kwargs):
        logging.info("Received response for %s", response.url)
        yield {"url": response.url}

output:

(...)
2024-11-01 09:49:11 [scrapy.core.engine] INFO: Spider opened
2024-11-01 09:49:11 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-11-01 09:49:11 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-11-01 09:49:11 [scrapy-playwright] INFO: Starting download handler
2024-11-01 09:49:16 [scrapy-playwright] INFO: Launching browser chromium
2024-11-01 09:49:16 [scrapy-playwright] INFO: Browser chromium launched
2024-11-01 09:49:16 [scrapy-playwright] DEBUG: Browser context started: 'default' (persistent=False, remote=False)
2024-11-01 09:49:16 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
2024-11-01 09:49:16 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://example.org/> (resource type: document)
2024-11-01 09:49:17 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://example.org/>
2024-11-01 09:49:17 [scrapy-playwright] WARNING: Closing page due to failed request: <GET https://example.org> exc_type=<class 'playwright._impl._errors.TimeoutError'> exc_msg=Page.wait_for_selector: Timeout 100ms exceeded.
Call log:
waiting for locator("//div[contains(@id, 'asdf')]") to be visible
Traceback (most recent call last):
  File "/.../scrapy-playwright/scrapy_playwright/handler.py", line 431, in _download_request_with_retry
    return await self._download_request_with_page(request, page, spider)
  File "/.../scrapy-playwright/scrapy_playwright/handler.py", line 479, in _download_request_with_page
    await self._apply_page_methods(page, request, spider)
  File "/.../scrapy-playwright/scrapy_playwright/handler.py", line 625, in _apply_page_methods
    pm.result = await _maybe_await(method(*pm.args, **pm.kwargs))
  File "/.../scrapy-playwright/scrapy_playwright/_utils.py", line 21, in _maybe_await
    return await obj
  File "/.../scrapy-playwright/venv-scrapy-playwright/lib/python3.10/site-packages/playwright/async_api/_generated.py", line 7994, in wait_for_selector
    await self._impl_obj.wait_for_selector(
  File "/.../scrapy-playwright/venv-scrapy-playwright/lib/python3.10/site-packages/playwright/_impl/_page.py", line 397, in wait_for_selector
    return await self._main_frame.wait_for_selector(**locals_to_params(locals()))
  File "/.../scrapy-playwright/venv-scrapy-playwright/lib/python3.10/site-packages/playwright/_impl/_frame.py", line 323, in wait_for_selector
    await self._channel.send("waitForSelector", locals_to_params(locals()))
  File "/.../scrapy-playwright/venv-scrapy-playwright/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 59, in send
    return await self._connection.wrap_api_call(
  File "/.../scrapy-playwright/venv-scrapy-playwright/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 514, in wrap_api_call
    raise rewrite_error(error, f"{parsed_st['apiName']}: {error}") from None
playwright._impl._errors.TimeoutError: Page.wait_for_selector: Timeout 100ms exceeded.
Call log:
waiting for locator("//div[contains(@id, 'asdf')]") to be visible

2024-11-01 09:49:17 [root] INFO: Caught exception: <class 'playwright._impl._errors.TimeoutError'> for request <GET https://example.org>, retrying
2024-11-01 09:49:17 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
2024-11-01 09:49:17 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://example.org/> (resource type: document)
2024-11-01 09:49:17 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://example.org/>
2024-11-01 09:49:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.org> (referer: None) ['playwright']
2024-11-01 09:49:17 [root] INFO: Received response for https://example.org/
2024-11-01 09:49:17 [scrapy.core.scraper] DEBUG: Scraped from <200 https://example.org/>
{'url': 'https://example.org/'}
2024-11-01 09:49:17 [scrapy.core.engine] INFO: Closing spider (finished)
(...)