Asynchronous Cloudflare scraper middleware for Scrapy.
This as a short example of how to integrate asyncio and cloudscraper in Scrapy.
- Scrapy >= 2.0 (needed for async support)
- cloudscraper (needed for bypassing Cloudflare)
pip install cloudscraper
Enable the middleware in scrapy's settings
If you want to know more about the TWISTED_REACTOR
setting, see Scrapy's document.
DOWNLOADER_MIDDLEWARES = {
# ...
'your.scrapy.project.middlewares.CloudflareMiddleware': 543,
# ...
}
TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'
Enable the middleware in your requests
def start_requests(self):
...
return scrapy.Request(url, meta={'cloudflare': True})
Enable the middleware in pipeline requests (if needed)
class CustomImagesPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
requests = super().get_media_requests(item, info)
for req in requests:
req.meta['cloudflare'] = True
return requests
Please refer to VeNoMouS/cloudscraper for more usages of cloudscraper
.
To pass more arguments to cloudscraper, change this line to:
response = await self._cloudscraper_get(request.url, *your_args, **your_kwargs)