elacuesta/scrapy-pyppeteer

scrapy-pyppeteer not working with Scrapy==2.4

clippered opened this issue · 4 comments

Hi,

Looks like this is a promising project. I'm trying to play with it but no luck so far.

import scrapy
import scrapy.crawler
import pyppeteer

class TestMeSpider(scrapy.Spider):
    name = "test_me"

    def start_requests(self):
        yield scrapy.Request("https://httpbin.org/headers", meta=dict(pyppeteer=True))

    async def parse(self, response, page: pyppeteer.page.Page):
        await page.screenshot(options={"path": "headers.png"})
        await page.close()


process = scrapy.crawler.CrawlerProcess(settings={
    "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
     "DOWNLOAD_HANDLERS": {
        "http": "scrapy_pyppeteer.handler.ScrapyPyppeteerDownloadHandler",
        "https": "scrapy_pyppeteer.handler.ScrapyPyppeteerDownloadHandler",
    },
})
process.crawl(TestMeSpider)
process.start()

Here are the logs, including the versions of Python, Scrapy and OS:

2020-10-20 14:45:57 [scrapy.utils.log] INFO: Scrapy 2.4.0 started (bot: scrapybot)
2020-10-20 14:45:57 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.6 (default, Oct  8 2020, 14:06:32) - [Clang 12.0.0 (clang-1200.0.32.2)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h  22 Sep 2020), cryptography 3.1.1, Platform macOS-10.15.7-x86_64-i386-64bit
2020-10-20 14:45:57 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2020-10-20 14:45:57 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.unix_events._UnixSelectorEventLoop
2020-10-20 14:45:57 [scrapy.crawler] INFO: Overridden settings:
{'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
...
2020-10-20 14:45:57 [scrapy-pyppeteer] INFO: _launch_browser_signal_handler
2020-10-20 14:45:57 [asyncio] DEBUG: Using selector: KqueueSelector
2020-10-20 14:45:57 [scrapy-pyppeteer] INFO: _launch_browser_signal_handler
2020-10-20 14:46:57 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
...

Seems like it got stuck. Any help would be appreciated.
Thanks

Hi! Thanks for using this package.

I can reproduce the issue with scrapy==2.4, but it's working fine for me with scrapy==2.3. Thanks for the report, I will investigate.

Full code and logs, for reference
import logging

from scrapy import Spider, Request
from scrapy.crawler import CrawlerProcess
import pyppeteer


logging.getLogger("websockets").setLevel(logging.INFO)
logging.getLogger("pyppeteer").setLevel(logging.INFO)


class TestMeSpider(Spider):
    name = "test_me"

    def start_requests(self):
        yield Request("https://httpbin.org/headers", meta=dict(pyppeteer=True))

    async def parse(self, response, page: pyppeteer.page.Page):
        await page.screenshot(options={"path": "headers.png"})
        await page.close()


if __name__ == "__main__":
    process = CrawlerProcess(
        settings={
            "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
            "DOWNLOAD_HANDLERS": {
                "https": "scrapy_pyppeteer.handler.ScrapyPyppeteerDownloadHandler",
            },
            "PYPPETEER_LAUNCH_OPTIONS": {
                "executablePath": "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome",
            },
        }
    )
    process.crawl(TestMeSpider)
    process.start()
2020-10-28 08:49:28 [scrapy.utils.log] INFO: Scrapy 2.3.0 started (bot: scrapybot)
2020-10-28 08:49:28 [scrapy.utils.log] INFO: Versions: lxml 4.6.1.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.6 (v3.8.6:db455296be, Sep 23 2020, 13:31:39) - [Clang 6.0 (clang-600.0.57)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h  22 Sep 2020), cryptography 3.1.1, Platform macOS-10.15.7-x86_64-i386-64bit
2020-10-28 08:49:28 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2020-10-28 08:49:28 [scrapy.crawler] INFO: Overridden settings:
{'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2020-10-28 08:49:28 [scrapy.extensions.telnet] INFO: Telnet Password: 612f2d03b068d7dc
2020-10-28 08:49:28 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2020-10-28 08:49:28 [scrapy-pyppeteer] INFO: Browser launch options: {'executablePath': '/Applications/Google Chrome.app/Contents/MacOS/Google Chrome'}
2020-10-28 08:49:28 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-10-28 08:49:28 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-10-28 08:49:28 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-10-28 08:49:28 [scrapy.core.engine] INFO: Spider opened
2020-10-28 08:49:28 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-10-28 08:49:28 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
[I:pyppeteer.launcher] Browser listening on: ws://127.0.0.1:50541/devtools/browser/eb34a4b7-33d1-486f-ab38-4e0451ba50e5
2020-10-28 08:49:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://httpbin.org/headers> (referer: None) ['pyppeteer']
2020-10-28 08:49:34 [scrapy.core.engine] INFO: Closing spider (finished)
2020-10-28 08:49:34 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 218,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 841,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 6.006192,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 10, 28, 11, 49, 34, 946792),
 'log_count/DEBUG': 1,
 'log_count/INFO': 11,
 'memusage/max': 47808512,
 'memusage/startup': 47804416,
 'pyppeteer/page_count': 1,
 'pyppeteer/page_count/injected_callback': 1,
 'pyppeteer/request_count': 1,
 'pyppeteer/request_count/navigation': 1,
 'pyppeteer/request_method_count/GET': 1,
 'pyppeteer/response_count': 1,
 'pyppeteer/response_status_count/200': 1,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2020, 10, 28, 11, 49, 28, 940600)}
2020-10-28 08:49:34 [scrapy.core.engine] INFO: Spider closed (finished)
[I:pyppeteer.launcher] terminate chrome process...

Seems like this is related to scrapy/scrapy#4855. I've submitted a patch (scrapy/scrapy#4872), which makes the above program work. While it's still in progress, you can try the change by installing Scrapy directly from the PR's branch:

pip install --upgrade https://github.com/elacuesta/scrapy/archive/asyncio_get_event_loop.zip

Amazing, tried it and it works!
Can't wait to get the fix merged to scrapy, thanks!

Scrapy 2.4.0 should be skipped after 04b09b3 (tagged v0.0.11). Closing, many thanks for the report @clippered 😄