Set User Agent
XioYang23 opened this issue · 5 comments
Hey looking for the correct way to change/set the user agent
Ive tried
PYPPETEER_LAUNCH_OPTIONS={
'headless':False,
'agrs':[(
'userAgent','THE USER AGENT VALUE HERE'
)]
}
&
await page.setUserAgent('userAgent','UA VALUE HERE')
await pyppeteer.page.Page.setUserAgent('userAgent','UA VALUE HERE')
&
await page.setUserAgent('UA VALUE HERE')
await pyppeteer.page.Page.setUserAgent('UA VALUE HERE')
ive done a mixture of these, with all the same results.
None of these set the user-agent value and code still uses the default chrome 71 agent
the default ua=
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3542.0 Safari/537.36\
Hi! The simplest way is probably to modify the USER_AGENT
Scrapy setting. See the following example:
import logging
import scrapy
import pyppeteer
logging.getLogger("pyppeteer").setLevel(logging.INFO)
logging.getLogger("websockets").setLevel(logging.INFO)
class UserAgentSpider(scrapy.Spider):
name = "user_agent"
custom_settings = {
"USER_AGENT": "The Magic Words are Squeamish Ossifrage",
"DOWNLOAD_HANDLERS": {
"https": "scrapy_pyppeteer.handler.ScrapyPyppeteerDownloadHandler",
},
}
def start_requests(self):
yield scrapy.Request("https://httpbin.org/headers", meta=dict(pyppeteer=True))
async def parse(self, response, page: pyppeteer.page.Page):
await page.screenshot(options={"path": "headers.png"})
await page.close()
scrapy runspider user_agent.py -s TWISTED_REACTOR=twisted.internet.asyncioreactor.AsyncioSelectorReactor
then produces the following screenshot:
Additionally, you could use the DEFAULT_REQUEST_HEADERS
setting, or handle it on a per-request basis by providing a value for the User-Agent
key in the headers
parameter when creating each Request
.
Thanks for using this package!
Hey apologize for the late response
Using:
class UserAgentSpider(scrapy.Spider):
name = "user_agent"
custom_settings = {
"USER_AGENT": "THIS IS WHERE THE USERAGENT IS ",
"DOWNLOAD_HANDLERS": {
"https": "scrapy_pyppeteer.handler.ScrapyPyppeteerDownloadHandler",
},
}
That seems to work at the response from scrapy
Pyppeteer seems to still use a different user agent
Example:
[D:pyppeteer.connection.Connection]
\"User-Agent\":\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3542.0 Safari/537.36\"}
from the log, its using this user agent^
[scrapy.core.engine] DEBUG: Crawled (200) <GET > (referer: None) ['pyppeteer']
{b'User-Agent': [b'THIS IS WHERE THE USERAGENT IS ']}
Then scrapy kicks in at the response
Any way to change the request agent?
(edited for syntax highlighting)
Request headers are updated here (notice they are only updated for "intentional" requests, i.e., not secondary requests triggered by the browser like images, stylesheets, fonts, etc). I think what you are observed can be caused by one of the following:
- The message corresponds to one of those secondary requests.
- The message corresponds to a request from Scrapy, but it's logged before the headers are updated.
Regarding (1), I will update the above function so all requests include the defined User Agent header. Regarding (2), I can just tried with the above spider (#5 (comment)), which makes only one request, and by commenting out the logging.getLogger("pyppeteer").setLevel(logging.INFO)
line you can see that the original User Agent from Chrome does appear in some messages, however the updated one is the one that actually reaches the website.
Right, the default Chrome version comes bundled with the upstream pyppeteer
package, but you can override it by setting PYPPETEER_LAUNCH_OPTIONS = {"executablePath": "/path/to/a/specific/chrome/binary"}