scrapy-plugins/scrapy-playwright

Modify Response object

Opened this issue · 1 comments

Thank you for creating such an amazing package! My goal is to retrieve only the intercepted json_data without the full HTML page content. Is there a way to set intercepted json_data as body to Response object, that received in the parse callback?

    def start_requests(self):
        url = "https://littlecaesars.com/en-us/order/pickup/stores/search/75215/"
        yield scrapy.Request(url, callback=self.parse, meta={
            "playwright": True,
            "playwright_page_methods": [
                PageMethod("route", "**/api/GetClosestStores", self.capture_request),
                PageMethod("wait_for_selector", "//button[contains(text(), 'Start your order')]"),
            ]
        })


    async def capture_request(self, route: Route):
        response = await route.fetch()
        json_data = await response.json()
        await route.fulfill(response=response, json=json_data)


    def parse(self, response: Response):
        pass

( Due to reCAPTCHA protection, using Playwright is essential here )

I'd suggest not to set a custom route, it's usually not a good idea given all the processing done in the handler. I think you can do something similar to this, i.e. intercepting the "response" event, extracting what you need, storing it somewhere temporarily, retrieving it in the main response callback and returning it there. You might need something like an asyncio.Event if you ran into synchronization issues e.g. if the callback runs before the response event handler (though I'm just thinking out loud, not sure that's even an actual possibility).