PLAYWRIGHT_RESTART_DISCONNECTED_BROWSER not working on local browser
elacuesta opened this issue · 1 comments
The handler is not allowing enough time for the new browser to launch after a crash.
Sample spider adapted from #167.
# crash.py
import os
from signal import SIGKILL
import psutil
import scrapy
class CrashSpider(scrapy.Spider):
name = "crash"
custom_settings = {
"TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
"DOWNLOAD_HANDLERS": {
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},
}
def start_requests(self):
yield scrapy.Request("https://httpbin.org/get", meta={"playwright": True})
def parse(self, response):
print("request:{}".format(response.request.url))
for proc in psutil.process_iter(["pid", "name"]):
if proc.info["name"] == "chrome":
os.kill(proc.info["pid"], SIGKILL)
yield scrapy.Request("https://httpbin.org/headers", meta={"playwright": True})
$ scrapy runspider crash.py
(...)
2024-07-16 14:55:09 [scrapy.core.engine] INFO: Spider opened
2024-07-16 14:55:09 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-07-16 14:55:09 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-07-16 14:55:09 [scrapy-playwright] INFO: Starting download handler
2024-07-16 14:55:14 [scrapy-playwright] INFO: Launching browser chromium
2024-07-16 14:55:14 [scrapy-playwright] INFO: Browser chromium launched
2024-07-16 14:55:14 [scrapy-playwright] DEBUG: Browser context started: 'default' (persistent=False, remote=False)
2024-07-16 14:55:14 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
2024-07-16 14:55:14 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://httpbin.org/get> (resource type: document)
2024-07-16 14:55:14 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://httpbin.org/get>
2024-07-16 14:55:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://httpbin.org/get> (referer: None) ['playwright']
Response: <200 https://httpbin.org/get>
2024-07-16 14:55:14 [scrapy-playwright] DEBUG: Browser context closed: 'default' (persistent=False, remote=False)
2024-07-16 14:55:14 [scrapy-playwright] DEBUG: Browser disconnected
2024-07-16 14:55:15 [scrapy.core.scraper] ERROR: Error downloading <GET https://httpbin.org/headers>
Traceback (most recent call last):
File "/.../venv-scrapy-playwright/lib/python3.10/site-packages/twisted/internet/defer.py", line 1996, in _inlineCallbacks
result = context.run(
File "/.../venv-scrapy-playwright/lib/python3.10/site-packages/twisted/python/failure.py", line 519, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/.../venv-scrapy-playwright/lib/python3.10/site-packages/scrapy/core/downloader/middleware.py", line 54, in process_request
return (yield download_func(request=request, spider=spider))
File "/.../venv-scrapy-playwright/lib/python3.10/site-packages/twisted/internet/defer.py", line 1248, in adapt
extracted: _SelfResultT | Failure = result.result()
File "/.../scrapy_playwright/handler.py", line 358, in _download_request
page = await self._create_page(request=request, spider=spider)
File "/.../scrapy_playwright/handler.py", line 286, in _create_page
page = await ctx_wrapper.context.new_page()
File "/.../venv-scrapy-playwright/lib/python3.10/site-packages/playwright/async_api/_generated.py", line 12379, in new_page
return mapping.from_impl(await self._impl_obj.new_page())
File "/.../venv-scrapy-playwright/lib/python3.10/site-packages/playwright/_impl/_browser_context.py", line 294, in new_page
return from_channel(await self._channel.send("newPage"))
File "/.../venv-scrapy-playwright/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 59, in send
return await self._connection.wrap_api_call(
File "/.../venv-scrapy-playwright/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 514, in wrap_api_call
raise rewrite_error(error, f"{parsed_st['apiName']}: {error}") from None
playwright._impl._errors.TargetClosedError: BrowserContext.new_page: Target page, context or browser has been closed
Browser logs:
<launching> /home/eugenio/.cache/ms-playwright/chromium-1117/chrome-linux/chrome --disable-field-trial-config --disable-background-networking --enable-features=NetworkService,NetworkServiceInProcess --disable-background-timer-throttling --disable-backgrounding-occluded-windows --disable-back-forward-cache --disable-breakpad --disable-client-side-phishing-detection --disable-component-extensions-with-background-pages --disable-component-update --no-default-browser-check --disable-default-apps --disable-dev-shm-usage --disable-extensions --disable-features=ImprovedCookieControls,LazyFrameLoading,GlobalMediaControls,DestroyProfileOnBrowserClose,MediaRouter,DialMediaRouteProvider,AcceptCHFrame,AutoExpandDetailsElement,CertificateTransparencyComponentUpdater,AvoidUnnecessaryBeforeUnloadCheckSync,Translate,HttpsUpgrades,PaintHolding --allow-pre-commit-input --disable-hang-monitor --disable-ipc-flooding-protection --disable-popup-blocking --disable-prompt-on-repost --disable-renderer-backgrounding --force-color-profile=srgb --metrics-recording-only --no-first-run --enable-automation --password-store=basic --use-mock-keychain --no-service-autorun --export-tagged-pdf --disable-search-engine-choice-screen --headless --hide-scrollbars --mute-audio --blink-settings=primaryHoverType=2,availableHoverTypes=2,primaryPointerType=4,availablePointerTypes=4 --no-sandbox --user-data-dir=/tmp/playwright_chromiumdev_profile-XXXXXXTy2tU6 --remote-debugging-pipe --no-startup-window
<launched> pid=59155
[pid=59155][err] [0716/145514.301003:INFO:config_dir_policy_loader.cc(118)] Skipping mandatory platform policies because no policy file was found at: /etc/chromium/policies/managed
[pid=59155][err] [0716/145514.301041:INFO:config_dir_policy_loader.cc(118)] Skipping recommended platform policies because no policy file was found at: /etc/chromium/policies/recommended
[pid=59155][err] [0716/145514.308584:WARNING:bluez_dbus_manager.cc(248)] Floss manager not present, cannot set Floss enable/disable.
[pid=59155][err] [0716/145514.343012:WARNING:sandbox_linux.cc(436)] InitializeSandbox() called with multiple threads in process gpu-process.
2024-07-16 14:55:15 [scrapy.core.engine] INFO: Closing spider (finished)
(...)
$ scrapy version -v
Scrapy : 2.11.1
lxml : 5.1.0.0
libxml2 : 2.12.3
cssselect : 1.2.0
parsel : 1.8.1
w3lib : 2.1.2
Twisted : 23.10.0
Python : 3.10.12 (main, Mar 22 2024, 16:50:05) [GCC 11.4.0]
pyOpenSSL : 24.0.0 (OpenSSL 3.2.1 30 Jan 2024)
cryptography : 42.0.5
Platform : Linux-6.5.0-41-generic-x86_64-with-glibc2.35
$ python -c "import scrapy_playwright; print(scrapy_playwright.__version__)"
0.0.39
I don't think this can be handled with locking or other synchronization primitives, as the browser crash could happen at any time. Retrying seems like the most sensible way.
Hi @elacuesta in relation to my issue here: #294
I think the update you made to have the browser restarted worked for me! However I have this retry middleware enabled. I just find it weird that when the browser crashes, it does show up again but it seems that the retry middleware I've made doesn't necessarily retry that specific request anymore. I'm not sure why , it could be my middleware but just letting you know just in case. Let me know if you need additional info. Thank you.