apify/crawlee

Error: Invalid "proxyUrl" option: only HTTP proxies are currently supported

mehrdad-shokri opened this issue · 1 comments

Which package is this bug report for? If unsure which one to select, leave blank

@crawlee/playwright (PlaywrightCrawler)

Issue description

The docs never mention that only http proxies are supported. I think using http proxies are a security risk. Digging deeper you end up in here which crawlee uses. I think it should support HTTPS proxies as well.

Code sample

const proxyConfiguration = new ProxyConfiguration({
    proxyUrls: [
      'http://Username:Password@proxyUrl:PORT',
    ],
  });
  const crawler = new PlaywrightCrawler(
    {
      proxyConfiguration,
      // Use the requestHandler to process each of the crawled pages.
      async requestHandler({request, page, enqueueLinks, log, crawler}) {
        const title = await page.title();
        content = await page.content();
        log.info(`Title of ${request.loadedUrl} is '${title}'`);
        // Save results as JSON to ./storage/datasets/default
        await Dataset.pushData({title, url: request.loadedUrl, content});

        // Extract links from the current page
        // and add them to the crawling queue.
        await enqueueLinks();
      },
      maxRequestsPerCrawl: 1,
      maxConcurrency: 20,
      retryOnBlocked: true,
      maxRequestRetries: 10,
    },
    new Configuration({
      persistStorage: false,
      maxUsedCpuRatio: 0.95,
      availableMemoryRatio: 0.5,
    }),
  );

await crawler.run([url])

Package version

crawlee@3.11.0 proxy-chain@2.5.1

Node.js version

v20.10.0 typescript@5.5.2

Operating system

macOS

Apify platform

  • Tick me if you encountered this issue on the Apify platform

I have tested this on the next release

No response

Other context

No response

Hello - and thank you for your interest in this project.

Can you please provide reproduction scenario for the issue you are having?

"I think using http proxies are a security risk"

Note that this is not true - if you are connecting to the target server via HTTPS, the traffic is still end-to-end encrypted. With HTTP proxies, this is achieved via HTTP CONNECT method, which creates an opaque data tunnel from the client to the proxy server, through which the encrypted data is transferred. The intermediate proxy server cannot read this data (as it's encrypted).

If you are connecting to an HTTP target server (or you decide to fiddle around with the TLS settings - see e.g. comments under this issue), the proxy can indeed act as MITM and read your traffic - but you really have to want this - it will never happen with the default