Request Retry on Page Error
Ehsan-U opened this issue · 1 comments
Ehsan-U commented
I know scrapy support retries against HTTP statuses but i came across an issue where i can get the page on second retry, the error that raised is not HTTP, its page related error but Scrapy-Playwright has no way to retry the request in such cases. It would be great to have an option to retry when particular type of error occured. like in this case TimeoutError raised if selector not found.
def start_requests(self) -> Iterable[Request]:
url = f"https://www.elliman.com/offices/usa"
yield scrapy.Request(url, callback=self.parse, meta={
"playwright": True,
"playwright_page_methods": [
PageMethod("wait_for_selector", "//div[contains(@id, 'brokeritem')]", state="attached")
],
"dont_redirect": True
})
elacuesta commented
Sounds like a job for a Downloader Middleware, specifically the process_exception method.
The following snippet is adapted from the example included in https://github.com/scrapy-plugins/scrapy-playwright/blob/v0.0.41/examples/exception_middleware.py:
import logging
from scrapy import Spider, Request
from scrapy_playwright.page import PageMethod
class HandleTimeoutMiddleware:
def process_exception(self, request, exception, spider):
logging.info(
"Caught exception: %s for request %s, retrying",
exception.__class__,
request,
)
return Request(
url=request.url,
meta={"playwright": True},
dont_filter=True,
)
class HandleExceptionInMiddlewareSpider(Spider):
name = "exception"
custom_settings = {
"DOWNLOADER_MIDDLEWARES": {
HandleTimeoutMiddleware: 100,
},
"TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
"DOWNLOAD_HANDLERS": {
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},
}
def start_requests(self):
yield Request(
url="https://example.org",
meta={
"playwright": True,
"playwright_page_methods": [
PageMethod(
"wait_for_selector",
"//div[contains(@id, 'asdf')]",
timeout=100,
)
],
},
)
def parse(self, response, **kwargs):
logging.info("Received response for %s", response.url)
yield {"url": response.url}
output:
(...)
2024-11-01 09:49:11 [scrapy.core.engine] INFO: Spider opened
2024-11-01 09:49:11 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-11-01 09:49:11 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-11-01 09:49:11 [scrapy-playwright] INFO: Starting download handler
2024-11-01 09:49:16 [scrapy-playwright] INFO: Launching browser chromium
2024-11-01 09:49:16 [scrapy-playwright] INFO: Browser chromium launched
2024-11-01 09:49:16 [scrapy-playwright] DEBUG: Browser context started: 'default' (persistent=False, remote=False)
2024-11-01 09:49:16 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
2024-11-01 09:49:16 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://example.org/> (resource type: document)
2024-11-01 09:49:17 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://example.org/>
2024-11-01 09:49:17 [scrapy-playwright] WARNING: Closing page due to failed request: <GET https://example.org> exc_type=<class 'playwright._impl._errors.TimeoutError'> exc_msg=Page.wait_for_selector: Timeout 100ms exceeded.
Call log:
waiting for locator("//div[contains(@id, 'asdf')]") to be visible
Traceback (most recent call last):
File "/.../scrapy-playwright/scrapy_playwright/handler.py", line 431, in _download_request_with_retry
return await self._download_request_with_page(request, page, spider)
File "/.../scrapy-playwright/scrapy_playwright/handler.py", line 479, in _download_request_with_page
await self._apply_page_methods(page, request, spider)
File "/.../scrapy-playwright/scrapy_playwright/handler.py", line 625, in _apply_page_methods
pm.result = await _maybe_await(method(*pm.args, **pm.kwargs))
File "/.../scrapy-playwright/scrapy_playwright/_utils.py", line 21, in _maybe_await
return await obj
File "/.../scrapy-playwright/venv-scrapy-playwright/lib/python3.10/site-packages/playwright/async_api/_generated.py", line 7994, in wait_for_selector
await self._impl_obj.wait_for_selector(
File "/.../scrapy-playwright/venv-scrapy-playwright/lib/python3.10/site-packages/playwright/_impl/_page.py", line 397, in wait_for_selector
return await self._main_frame.wait_for_selector(**locals_to_params(locals()))
File "/.../scrapy-playwright/venv-scrapy-playwright/lib/python3.10/site-packages/playwright/_impl/_frame.py", line 323, in wait_for_selector
await self._channel.send("waitForSelector", locals_to_params(locals()))
File "/.../scrapy-playwright/venv-scrapy-playwright/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 59, in send
return await self._connection.wrap_api_call(
File "/.../scrapy-playwright/venv-scrapy-playwright/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 514, in wrap_api_call
raise rewrite_error(error, f"{parsed_st['apiName']}: {error}") from None
playwright._impl._errors.TimeoutError: Page.wait_for_selector: Timeout 100ms exceeded.
Call log:
waiting for locator("//div[contains(@id, 'asdf')]") to be visible
2024-11-01 09:49:17 [root] INFO: Caught exception: <class 'playwright._impl._errors.TimeoutError'> for request <GET https://example.org>, retrying
2024-11-01 09:49:17 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
2024-11-01 09:49:17 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://example.org/> (resource type: document)
2024-11-01 09:49:17 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://example.org/>
2024-11-01 09:49:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.org> (referer: None) ['playwright']
2024-11-01 09:49:17 [root] INFO: Received response for https://example.org/
2024-11-01 09:49:17 [scrapy.core.scraper] DEBUG: Scraped from <200 https://example.org/>
{'url': 'https://example.org/'}
2024-11-01 09:49:17 [scrapy.core.engine] INFO: Closing spider (finished)
(...)