Rate limiting isn't actually implemented correctly
glennpow opened this issue · 3 comments
I was getting rate-limit errors, even though I'd specified a rate_limit_per_minute
arg of 1, so I looked into the RateLimitCache
class. I see that the new()
method is actually never getting called, so this class cannot be performing as expected. I added the following two lines to the bottom of the _impose_rate_limit
method, and it seems to (temporarily) fix the problem.
if hasattr(self, '_rlcache'):
self._rlcache.new()
However, this is not even the proper solution. The rate-limit will still kick in, since the API is actually limited to 1 request/sec (as opposed to 60 requests per 60 seconds). This implementation will send a few requests as fast as possible, then error out. The correct implementation will limit the number of requests per second, rather than per minute.
Interesting... I'm not opposed to changing the rate limit system, but I think this needs consultation/coordination with Pushshift (i.e. Jason).
The rate limit isn't actually a fixed value. Sometimes Jason throttles requests because the server is being DDOSed. Sometimes he grants specific users higher priority access. To solve for this, PSAW hits the /meta endpoint on initialization to ask pushshift directly how it should handle throttling. Here's the response I'm seeing right now:
{
"client_accepts_json": true,
"client_request_headers": {
"ACCEPT": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"ACCEPT-ENCODING": "gzip",
"ACCEPT-LANGUAGE": "en-US,en;q=0.9",
"CDN-LOOP": "cloudflare",
"CF-CONNECTING-IP": "67.183.220.19",
"CF-IPCOUNTRY": "US",
"CF-RAY": "5dc08b0acded02b8-SEA",
"CF-REQUEST-ID": "058c353abc000002b8b5888200000001",
"CF-VISITOR": "{\"scheme\":\"https\"}",
"CONNECTION": "close",
"HOST": "api.pushshift.io",
"SEC-FETCH-DEST": "document",
"SEC-FETCH-MODE": "navigate",
"SEC-FETCH-SITE": "none",
"UPGRADE-INSECURE-REQUESTS": "1",
"USER-AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36",
"X-FORWARDED-FOR": "67.183.220.19, 67.183.220.19",
"X-FORWARDED-PROTO": "https"
},
"client_user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36",
"server_ratelimit_per_minute": 120,
"source-ip": "67.183.220.19"
}
In particular, take note of the attribute server_ratelimit_per_minute.
PSAW sets it's rate limiting on a per-minute basis because according to the meta endpoint, so does pushshift. It sounds like there may be an additional per-second or per-k-seconds rate limit we should respect. If this is the case, I think the best way forward would be to add the per-second limit to the /meta endpoint so PSAW doesn't need to make assumption about what it's supposed to be.
That's interesting that you're observing a 60 requests/minute limit given that the server is still reporting 120. I wonder if maybe Jason changed the throttling without modifying the value reported by the /meta endpoint. If so, I'd recommend that he tie those things together so he doesn't need to manually change both separately.
@pushshift: thoughts?
@dmarx Thanks for the response. Actually, I wasn't even able to use 60 req/min without getting a rate limit error. I'm currently using 30. I do agree that it would be ideal to use the meta endpoint to set this, but clearly the value of 120/min already doesn't match with what was posted publicly as 1/sec in this thread: https://www.reddit.com/r/pushshift/comments/g7125k/in_an_effort_to_relieve_some_of_the_strain_on_the/
As long as the meta endpoint is accurate, this lib could limit per minute, but I'm not sure how reliable that metadata is.
As far as removing the n
parameter to RateLimitCache
(and only having interval
), you can currently achieve the same effect of something like (n=2, interval=1)
by specifying just (interval=0.5)
, which is what would happen if you specified rate_limit_per_minute=120
to PushshiftAPI
(which is also what it would use now by default if pulled from /meta).