Crawling Website does not use extra headers specified in configuration
Closed this issue · 6 comments
Problem Statement
In my project, we are building a Monitoring workflow to generate lighthouse reports on the website to check performance to keep track each release.
The Website in this case, uses cloudflare bot fight which has a specific user agent in the allow-list for our synthetic tools. For this reason, I am configuring unlighthouse to use that user agent to make the requests to the website.
Expected Behaviour
Upon setting the configuration lighthouseOptions.extraHeaders['User-Agent'], requests to sitemap.xml should use this User Agent so cloudflare doesn't block my agent.
Current Behaviour
Unlighthouse is always sending the 'Unlighthouse' user agent to the axios requests. This is caused by the line below:
unlighthouse/packages/core/src/util.ts
Line 139 in 8a79f62
When spreading the configuration, the latest item takes precedence when object attributes conflict. So in this code, my custom header for User Agent will be ignored, making sitemap.xml requests fail with 403.
I cannot add Unlighthouse as an allowed user agent in cloudflare, as It poses a risk.
Hi, please configure using the extraHeaders
config, not modifying the lighthouseOptions directly.
https://unlighthouse.dev/guide/guides/authentication#custom-headers-authentication
I have set the configurations as you mentioned (I am omitting sensitive information)
My configuration file looks like this:
const config = {
site: 'xxxx.xx',
ci: {
budget: {
performance: 0.9,
accessibility: 0.9,
'best-practices': 0.9,
seo: 0.9
}
},
scanner: {
exclude: ['/.*?pdf', '.*/amp', 'en-*', '.*?mp4'],
samples: 1,
sitemap: true,
robotsTxt: false,
throttle: false
},
extraHeaders: {
'User-Agent': 'XXXX'
},
debug: true
}
export default config
My github action (Im running it in a github worflow) is still failing with the message:
[warn] [Unlighthouse] Request to site xxxx.xx/ threw an unhandled exception. Please check the URL is valid and not blocking crawlers. Request failed with status code 403
I double checked and the user agent white listed in cloudflare is correct from the configuration. It's just not receving this user agent string.
New information.
This doesnt happen when I run locally. only in github ci environment.
In CI environment tho, I notice the requests for sitemap.xml are generated from node environment using fetch(). These requests are not sending the user agent header. Is It possible that github ci enforces the user agent header?
Hey, I think I caught the bug, can you try out v0.13.1?