Unhandled browser crash event
NiuBlibing opened this issue · 4 comments
When the chrome is killed or crash, the context will continue create newpage and throw exception:
2023-01-31 19:29:51 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.baidu.com>
Traceback (most recent call last):
File "/home/test/source/test/venv/lib/python3.10/site-packages/twisted/internet/defer.py", line 1656, in _inlineCallbacks
result = current_context.run(
File "/home/test/source/test/venv/lib/python3.10/site-packages/twisted/python/failure.py", line 489, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/home/test/source/test/venv/lib/python3.10/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request
return (yield download_func(request=request, spider=spider))
File "/home/test/source/test/venv/lib/python3.10/site-packages/twisted/internet/defer.py", line 1030, in adapt
extracted = result.result()
File "/home/test/source/test/venv/lib/python3.10/site-packages/scrapy_playwright/handler.py", line 261, in _download_request
page = await self._create_page(request)
File "/home/test/source/test/venv/lib/python3.10/site-packages/scrapy_playwright/handler.py", line 187, in _create_page
context = await self._create_browser_context(
File "/home/test/source/test/venv/lib/python3.10/site-packages/scrapy_playwright/handler.py", line 163, in _create_browser_context
context = await self.browser.new_context(**context_kwargs)
File "/home/test/source/test/venv/lib/python3.10/site-packages/playwright/async_api/_generated.py", line 13847, in new_context
await self._impl_obj.new_context(
File "/home/test/source/test/venv/lib/python3.10/site-packages/playwright/_impl/_browser.py", line 127, in new_context
channel = await self._channel.send("newContext", params)
File "/home/test/source/test/venv/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 44, in send
return await self._connection.wrap_api_call(
File "/home/test/source/test/venv/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 419, in wrap_api_call
return await cb()
File "/home/test/source/test/venv/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 79, in inner_send
result = next(iter(done)).result()
playwright._impl._api_types.Error: Target page, context or browser has been closed
I use this code:
from time import sleep
import scrapy
import psutil
import os
from signal import SIGKILL
class Debug1Spider(scrapy.Spider):
name = 'debug1'
allowed_domains = []
custom_settings = {
"PLAYWRIGHT_CONTEXTS": {
"default": {
"ignore_https_errors": True,
}
}
}
def start_requests(self):
yield scrapy.Request('https://www.httpbin.org/get', meta={"playwright": True, "playwright_include_page": False}, callback=self.debug_redirect)
yield scrapy.Request('https://www.httpbin.org/', meta={"playwright": True, "playwright_include_page": False}, callback=self.debug_redirect)
for proc in psutil.process_iter(['pid', 'name']):
if proc.info["name"] == "chrome":
os.kill(proc.info["pid"], SIGKILL)
async def parse(self, response):
print("request:{}".format(response.request.url))
Seems that it need to deal with the browser closed event
.
on("disconnected")
Emitted when Browser gets disconnected from the browser application. This might happen because of one of the following:
- Browser application is closed or crashed.
- The browser.close() method was called.
Is this patch ok?
---
scrapy_playwright/handler.py | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/scrapy_playwright/handler.py b/scrapy_playwright/handler.py
index 36c96cd..428f71d 100644
--- a/scrapy_playwright/handler.py
+++ b/scrapy_playwright/handler.py
@@ -132,6 +132,7 @@ class ScrapyPlaywrightDownloadHandler(HTTPDownloadHandler):
if not hasattr(self, "browser"):
logger.info("Launching browser %s", self.browser_type.name)
self.browser: Browser = await self.browser_type.launch(**self.launch_options)
+ self.browser.on("disconnected", self.__make_close_browser_callback())
logger.info("Browser %s launched", self.browser_type.name)
async def _create_browser_context(
@@ -447,6 +448,12 @@ class ScrapyPlaywrightDownloadHandler(HTTPDownloadHandler):
return close_browser_context_callback
+ def __make_close_browser_callback(self) -> Callable:
+ def close_browser_call() -> None:
+ logger.debug("Browser closed")
+ del self.browser
+ return close_browser_call
+
def _make_request_handler(
self,
context_name: str,
--
2.39.1
Is this patch ok?
Looks like a good start, however it might be necessary to also close the contexts like in https://github.com/scrapy-plugins/scrapy-playwright/blob/v0.0.26/scrapy_playwright/handler.py#L260-L261. This needs some research, I guess contexts might be implicitly closed because of the browser crash; in any case I'd like to make sure all related contexts are closed and removed from the context_wrappers
dict (another detail to consider is that persistent contexts are not tied to the browser instance, and there doesn't seem to be a way to listen to the disconnected
event on the context level).
This all assumes we would like the crawl to continue if the browser crashes: deleting the browser
attribute would cause any subsequent request to try to launch a new one. I'm actually more inclined to closing the engine and stopping everything if the browser crashes, but I'm willing to be proven wrong.
There is another problem, when the driver crash, it may not trigger crash event and the page.goto
is blocking which will not be timeout forever.
/usr/local/lib/python3.10/site-packages/playwright/driver/package/lib/server/chromium/crPage.js:378
this._firstNonInitialNavigationCommittedReject(new Error('Page closed'));
^
Error: Page closed
at CRSession.<anonymous> (/usr/local/lib/python3.10/site-packages/playwright/driver/package/lib/server/chromium/crPage.js:378:54)
at Object.onceWrapper (node:events:627:28)
at CRSession.emit (node:events:525:35)
at /usr/local/lib/python3.10/site-packages/playwright/driver/package/lib/server/chromium/crConnection.js:211:39