/scrapy-async-cloudflare

Asynchronous Cloudflare scraper middleware for scrapy.

Primary LanguagePythonMIT LicenseMIT

scrapy-async-cloudflare

Asynchronous Cloudflare scraper middleware for Scrapy.

This as a short example of how to integrate asyncio and cloudscraper in Scrapy.

Requirements

  • Scrapy >= 2.0 (needed for async support)
  • cloudscraper (needed for bypassing Cloudflare)
pip install cloudscraper

Usages

Enable the middleware in scrapy's settings

If you want to know more about the TWISTED_REACTOR setting, see Scrapy's document.

DOWNLOADER_MIDDLEWARES = {
    # ...
    'your.scrapy.project.middlewares.CloudflareMiddleware': 543,
    # ...
}
TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'

Enable the middleware in your requests

def start_requests(self):
    ...
    return scrapy.Request(url, meta={'cloudflare': True})

Enable the middleware in pipeline requests (if needed)

class CustomImagesPipeline(ImagesPipeline):

    def get_media_requests(self, item, info):
        requests = super().get_media_requests(item, info)
        for req in requests:
            req.meta['cloudflare'] = True
        return requests

Extension

Please refer to VeNoMouS/cloudscraper for more usages of cloudscraper.

To pass more arguments to cloudscraper, change this line to:

response = await self._cloudscraper_get(request.url, *your_args, **your_kwargs)