TeamHG-Memex/scrapy-rotating-proxies

Do not cache failed requests

Opened this issue · 4 comments

If HTTPCACHE_ENABLED is set to True and request fails due to ban, it seems that it still gets cached (using standard HttpCacheMiddleware) and thus all the following retries fail as well. Would it be possible to cache only successful requests? My guess is that it should be possible by setting dont_cache in request.meta either directly in BanDetectionMiddleware (right after detecting ban and setting request.meta['_ban'] to True), or using a custom middleware sitting just after BanDetectionMiddleware (and before HttpCacheMiddleware) doing basically the same thing. The first solution would be preferred though, presumably respecting some newly introduced setting, e.g. CACHE_BANNED_REQUESTS.

What do you think? If my assumptions are correct and proposed solution accepted, I might prepare a PR.

Does this middleware ignore HTTPCACHE_IGNORE_HTTP_CODES setting?

try HTTPCACHE_IGNORE_HTTP_CODES = [503, 504, 505, 500, 400, 401, 402, 403, 404]

@Granitosaurus That might help in certain cases when ban is set based on HTTP status. But in general, ban policy can be more complicated (even the defaut one sets a ban when HTTP status is 200 and response body is empty). So it would be best if caching was aligned with ban policy.

Any updates on this issue?

There's an easy fix by extending cache policy:

class BanAwarePolicy(DummyPolicy):

    def should_cache_response(self, response, request):
        # default
        valid_response_code = response.status not in self.ignore_http_codes
        # aware of bans
        ban = response.meta.get('_ban', False)
        return valid_response_code and not ban

and activate in your settings:

HTTPCACHE_POLICY = 'myproject.some_module.BanAwarePolicy'

This should probably be included with rotating-proxies package though.