Set User Agent

Question

Set User Agent

XioYang23 opened this issue 4 years ago · 5 comments

Hey looking for the correct way to change/set the user agent

Ive tried

PYPPETEER_LAUNCH_OPTIONS={
  'headless':False,
  'agrs':[(
    'userAgent','THE USER AGENT VALUE HERE'
    )]
}

&

await page.setUserAgent('userAgent','UA VALUE HERE')
await pyppeteer.page.Page.setUserAgent('userAgent','UA VALUE HERE')

&

await page.setUserAgent('UA VALUE HERE')
await pyppeteer.page.Page.setUserAgent('UA VALUE HERE')

ive done a mixture of these, with all the same results.

None of these set the user-agent value and code still uses the default chrome 71 agent

the default ua=
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3542.0 Safari/537.36\

Answer 1 · 2020-10-10T21:02:26.000Z

Hi! The simplest way is probably to modify the USER_AGENT Scrapy setting. See the following example:

import logging

import scrapy
import pyppeteer

logging.getLogger("pyppeteer").setLevel(logging.INFO)
logging.getLogger("websockets").setLevel(logging.INFO)

class UserAgentSpider(scrapy.Spider):
    name = "user_agent"
    custom_settings = {
        "USER_AGENT": "The Magic Words are Squeamish Ossifrage",
        "DOWNLOAD_HANDLERS": {
            "https": "scrapy_pyppeteer.handler.ScrapyPyppeteerDownloadHandler",
        },
    }

    def start_requests(self):
        yield scrapy.Request("https://httpbin.org/headers", meta=dict(pyppeteer=True))

    async def parse(self, response, page: pyppeteer.page.Page):
        await page.screenshot(options={"path": "headers.png"})
        await page.close()

scrapy runspider user_agent.py -s TWISTED_REACTOR=twisted.internet.asyncioreactor.AsyncioSelectorReactor
then produces the following screenshot:

Additionally, you could use the DEFAULT_REQUEST_HEADERS setting, or handle it on a per-request basis by providing a value for the User-Agent key in the headers parameter when creating each Request.

Thanks for using this package!

Answer 2 · 2020-12-10T23:37:15.000Z

Hey apologize for the late response

Using:

class UserAgentSpider(scrapy.Spider):
    name = "user_agent"
    custom_settings = {
        "USER_AGENT": "THIS IS WHERE THE USERAGENT IS ",
        "DOWNLOAD_HANDLERS": {
            "https": "scrapy_pyppeteer.handler.ScrapyPyppeteerDownloadHandler",
        },
    }

That seems to work at the response from scrapy
Pyppeteer seems to still use a different user agent

Example:

[D:pyppeteer.connection.Connection]
\"User-Agent\":\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3542.0 Safari/537.36\"}

from the log, its using this user agent^

[scrapy.core.engine] DEBUG: Crawled (200) <GET > (referer: None) ['pyppeteer']
{b'User-Agent': [b'THIS IS WHERE THE USERAGENT IS ']}

Then scrapy kicks in at the response

Any way to change the request agent?

(edited for syntax highlighting)

Answer 3 · 2020-12-11T16:24:09.000Z

Request headers are updated here (notice they are only updated for "intentional" requests, i.e., not secondary requests triggered by the browser like images, stylesheets, fonts, etc). I think what you are observed can be caused by one of the following:

The message corresponds to one of those secondary requests.
The message corresponds to a request from Scrapy, but it's logged before the headers are updated.

Regarding (1), I will update the above function so all requests include the defined User Agent header. Regarding (2), I can just tried with the above spider (#5 (comment)), which makes only one request, and by commenting out the logging.getLogger("pyppeteer").setLevel(logging.INFO) line you can see that the original User Agent from Chrome does appear in some messages, however the updated one is the one that actually reaches the website.

Answer 4 · 2020-12-19T01:51:02.000Z

It appears as the browsers user agent here is a photo of chrome telling me agent is outdated,
even tho ive added the latest its not the reading it correctly it seems, thxs

Also amazing middleware , love playing around with it! Thxs alot!

Answer 5 · 2020-12-21T16:20:14.000Z

Right, the default Chrome version comes bundled with the upstream pyppeteer package, but you can override it by setting PYPPETEER_LAUNCH_OPTIONS = {"executablePath": "/path/to/a/specific/chrome/binary"}