scrapy-plugins/scrapy-playwright

Inconsistent behavior between scrapy_playwright and playwright when accessing web pages

LoyAngel opened this issue · 3 comments

Hello,

I'm experiencing an inconsistency between scrapy_playwright and playwright when accessing web pages. While I can access web pages without any issues using playwright directly, I encounter a problem when using the scrapy_playwright framework. The web page detects a lower browser version and triggers a browser version warning.
I would like to understand the difference between the two approaches that could be causing this behavior. I have provided details of my environment setup and the source code for two separate tests below:

Environment Setup:

Operating System: Ubuntu 11.04
Python version: 3.9.2
Python packages:
playwright==1.42.0
Twisted==22.10.0
Scrapy==2.9.0
scrapy-playwright==0.0.34

Using Twitter for tesing below.

Source Code for Test 1 (using playwright directly):

from playwright.async_api import async_playwright

async def main():
    urls = "https://www.twitter.com"
    async with async_playwright() as p:
        browser = await p.chromium.launch()
        page = await browser.new_page()
        await page.goto(urls)
        await page.screenshot(path="example.png")
        await browser.close()

if __name__ == "__main__":
    import asyncio
    asyncio.run(main())

Result:
example

Source Code for Test 2 (using scrapy_playwright):

import scrapy

class PwTestSpider(scrapy.Spider):
    name = "pw_test"

    def start_requests(self):
        # GET request
        url = "https://www.twitter.com"
        request_meta = {
            "playwright": True,
            "playwright_include_page": True,
            "playwright_context_kwargs": {},
            "playwright_page_goto_kwargs": {"wait_until": "commit"},
            "handle_httpstatus_all": True
        }
        yield scrapy.Request(url, meta=request_meta, dont_filter=True)

    async def parse(self, response, **kwargs):
        # 'response' contains the page as seen by the browser
        page = response.meta["playwright_page"]
        await page.screenshot(path="screenshot.png")
        return {"url": response.url}

Result:
screenshot

I have compared the two test cases and cannot identify any significant differences that could explain this inconsistency. Therefore, I would appreciate any insights or guidance on why this discrepancy is occurring.

Thank you for any help!

By default you get Scrapy's user agent and it seems like the site does not like that. You can verify it by requesting https://httpbin.org/headers. See the section about the user agent header in the docs.

The problem has been successfully resolved. Thank you very, very much!!!

Thank you, buddy, for your help. I really appreciate it.