dmarx/psaw

Rate limiting isn't actually implemented correctly

glennpow opened this issue · 3 comments

I was getting rate-limit errors, even though I'd specified a rate_limit_per_minute arg of 1, so I looked into the RateLimitCache class. I see that the new() method is actually never getting called, so this class cannot be performing as expected. I added the following two lines to the bottom of the _impose_rate_limit method, and it seems to (temporarily) fix the problem.

        if  hasattr(self, '_rlcache'):
            self._rlcache.new()

However, this is not even the proper solution. The rate-limit will still kick in, since the API is actually limited to 1 request/sec (as opposed to 60 requests per 60 seconds). This implementation will send a few requests as fast as possible, then error out. The correct implementation will limit the number of requests per second, rather than per minute.

I do have a PR to fix this now: #81

dmarx commented

Interesting... I'm not opposed to changing the rate limit system, but I think this needs consultation/coordination with Pushshift (i.e. Jason).

The rate limit isn't actually a fixed value. Sometimes Jason throttles requests because the server is being DDOSed. Sometimes he grants specific users higher priority access. To solve for this, PSAW hits the /meta endpoint on initialization to ask pushshift directly how it should handle throttling. Here's the response I'm seeing right now:

{
    "client_accepts_json": true,
    "client_request_headers": {
        "ACCEPT": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
        "ACCEPT-ENCODING": "gzip",
        "ACCEPT-LANGUAGE": "en-US,en;q=0.9",
        "CDN-LOOP": "cloudflare",
        "CF-CONNECTING-IP": "67.183.220.19",
        "CF-IPCOUNTRY": "US",
        "CF-RAY": "5dc08b0acded02b8-SEA",
        "CF-REQUEST-ID": "058c353abc000002b8b5888200000001",
        "CF-VISITOR": "{\"scheme\":\"https\"}",
        "CONNECTION": "close",
        "HOST": "api.pushshift.io",
        "SEC-FETCH-DEST": "document",
        "SEC-FETCH-MODE": "navigate",
        "SEC-FETCH-SITE": "none",
        "UPGRADE-INSECURE-REQUESTS": "1",
        "USER-AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36",
        "X-FORWARDED-FOR": "67.183.220.19, 67.183.220.19",
        "X-FORWARDED-PROTO": "https"
    },
    "client_user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36",
    "server_ratelimit_per_minute": 120,
    "source-ip": "67.183.220.19"
}

In particular, take note of the attribute server_ratelimit_per_minute.

PSAW sets it's rate limiting on a per-minute basis because according to the meta endpoint, so does pushshift. It sounds like there may be an additional per-second or per-k-seconds rate limit we should respect. If this is the case, I think the best way forward would be to add the per-second limit to the /meta endpoint so PSAW doesn't need to make assumption about what it's supposed to be.

That's interesting that you're observing a 60 requests/minute limit given that the server is still reporting 120. I wonder if maybe Jason changed the throttling without modifying the value reported by the /meta endpoint. If so, I'd recommend that he tie those things together so he doesn't need to manually change both separately.

@pushshift: thoughts?

@dmarx Thanks for the response. Actually, I wasn't even able to use 60 req/min without getting a rate limit error. I'm currently using 30. I do agree that it would be ideal to use the meta endpoint to set this, but clearly the value of 120/min already doesn't match with what was posted publicly as 1/sec in this thread: https://www.reddit.com/r/pushshift/comments/g7125k/in_an_effort_to_relieve_some_of_the_strain_on_the/
As long as the meta endpoint is accurate, this lib could limit per minute, but I'm not sure how reliable that metadata is.
As far as removing the n parameter to RateLimitCache (and only having interval), you can currently achieve the same effect of something like (n=2, interval=1) by specifying just (interval=0.5), which is what would happen if you specified rate_limit_per_minute=120 to PushshiftAPI (which is also what it would use now by default if pulled from /meta).