Error: Invalid "proxyUrl" option: only HTTP proxies are currently supported
mehrdad-shokri opened this issue · 1 comments
Which package is this bug report for? If unsure which one to select, leave blank
@crawlee/playwright (PlaywrightCrawler)
Issue description
The docs never mention that only http
proxies are supported. I think using http proxies are a security risk. Digging deeper you end up in here which crawlee uses. I think it should support HTTPS proxies as well.
Code sample
const proxyConfiguration = new ProxyConfiguration({
proxyUrls: [
'http://Username:Password@proxyUrl:PORT',
],
});
const crawler = new PlaywrightCrawler(
{
proxyConfiguration,
// Use the requestHandler to process each of the crawled pages.
async requestHandler({request, page, enqueueLinks, log, crawler}) {
const title = await page.title();
content = await page.content();
log.info(`Title of ${request.loadedUrl} is '${title}'`);
// Save results as JSON to ./storage/datasets/default
await Dataset.pushData({title, url: request.loadedUrl, content});
// Extract links from the current page
// and add them to the crawling queue.
await enqueueLinks();
},
maxRequestsPerCrawl: 1,
maxConcurrency: 20,
retryOnBlocked: true,
maxRequestRetries: 10,
},
new Configuration({
persistStorage: false,
maxUsedCpuRatio: 0.95,
availableMemoryRatio: 0.5,
}),
);
await crawler.run([url])
Package version
crawlee@3.11.0 proxy-chain@2.5.1
Node.js version
v20.10.0 typescript@5.5.2
Operating system
macOS
Apify platform
- Tick me if you encountered this issue on the Apify platform
I have tested this on the next
release
No response
Other context
No response
Hello - and thank you for your interest in this project.
Can you please provide reproduction scenario for the issue you are having?
"I think using http proxies are a security risk"
Note that this is not true - if you are connecting to the target server via HTTPS, the traffic is still end-to-end encrypted. With HTTP proxies, this is achieved via HTTP CONNECT
method, which creates an opaque data tunnel from the client to the proxy server, through which the encrypted data is transferred. The intermediate proxy server cannot read this data (as it's encrypted).
If you are connecting to an HTTP target server (or you decide to fiddle around with the TLS settings - see e.g. comments under this issue), the proxy can indeed act as MITM and read your traffic - but you really have to want this - it will never happen with the default