elacuesta/scrapy-pyppeteer

Set User Agent

XioYang23 opened this issue · 5 comments

Hey looking for the correct way to change/set the user agent

Ive tried

PYPPETEER_LAUNCH_OPTIONS={
  'headless':False,
  'agrs':[(
    'userAgent','THE USER AGENT VALUE HERE'
    )]
}

&

await page.setUserAgent('userAgent','UA VALUE HERE')
await pyppeteer.page.Page.setUserAgent('userAgent','UA VALUE HERE')

&

await page.setUserAgent('UA VALUE HERE')
await pyppeteer.page.Page.setUserAgent('UA VALUE HERE')

ive done a mixture of these, with all the same results.

None of these set the user-agent value and code still uses the default chrome 71 agent

the default ua=
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3542.0 Safari/537.36\

Hi! The simplest way is probably to modify the USER_AGENT Scrapy setting. See the following example:

import logging

import scrapy
import pyppeteer

logging.getLogger("pyppeteer").setLevel(logging.INFO)
logging.getLogger("websockets").setLevel(logging.INFO)

class UserAgentSpider(scrapy.Spider):
    name = "user_agent"
    custom_settings = {
        "USER_AGENT": "The Magic Words are Squeamish Ossifrage",
        "DOWNLOAD_HANDLERS": {
            "https": "scrapy_pyppeteer.handler.ScrapyPyppeteerDownloadHandler",
        },
    }

    def start_requests(self):
        yield scrapy.Request("https://httpbin.org/headers", meta=dict(pyppeteer=True))

    async def parse(self, response, page: pyppeteer.page.Page):
        await page.screenshot(options={"path": "headers.png"})
        await page.close()

scrapy runspider user_agent.py -s TWISTED_REACTOR=twisted.internet.asyncioreactor.AsyncioSelectorReactor
then produces the following screenshot:
headers

Additionally, you could use the DEFAULT_REQUEST_HEADERS setting, or handle it on a per-request basis by providing a value for the User-Agent key in the headers parameter when creating each Request.

Thanks for using this package!

Hey apologize for the late response

Using:

class UserAgentSpider(scrapy.Spider):
    name = "user_agent"
    custom_settings = {
        "USER_AGENT": "THIS IS WHERE THE USERAGENT IS ",
        "DOWNLOAD_HANDLERS": {
            "https": "scrapy_pyppeteer.handler.ScrapyPyppeteerDownloadHandler",
        },
    }

That seems to work at the response from scrapy
Pyppeteer seems to still use a different user agent

Example:

[D:pyppeteer.connection.Connection]
\"User-Agent\":\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3542.0 Safari/537.36\"}

from the log, its using this user agent^

[scrapy.core.engine] DEBUG: Crawled (200) <GET > (referer: None) ['pyppeteer']
{b'User-Agent': [b'THIS IS WHERE THE USERAGENT IS ']}

Then scrapy kicks in at the response

Any way to change the request agent?

(edited for syntax highlighting)

Request headers are updated here (notice they are only updated for "intentional" requests, i.e., not secondary requests triggered by the browser like images, stylesheets, fonts, etc). I think what you are observed can be caused by one of the following:

  1. The message corresponds to one of those secondary requests.
  2. The message corresponds to a request from Scrapy, but it's logged before the headers are updated.

Regarding (1), I will update the above function so all requests include the defined User Agent header. Regarding (2), I can just tried with the above spider (#5 (comment)), which makes only one request, and by commenting out the logging.getLogger("pyppeteer").setLevel(logging.INFO) line you can see that the original User Agent from Chrome does appear in some messages, however the updated one is the one that actually reaches the website.

Screen Shot 2020-12-18 at 8 42 00 PM

It appears as the browsers user agent here is a photo of chrome telling me agent is outdated,
even tho ive added the latest its not the reading it correctly it seems, thxs

Also amazing middleware , love playing around with it! Thxs alot!

Right, the default Chrome version comes bundled with the upstream pyppeteer package, but you can override it by setting PYPPETEER_LAUNCH_OPTIONS = {"executablePath": "/path/to/a/specific/chrome/binary"}