Create GitHub token pool
elad-pticha opened this issue · 8 comments
It can be managed in multiple ways:
--token
flag could be specified multiple times, for example:
raven download org --token $GITHUB_TOKEN_1 --token $GITHUB_TOKEN_2
- Add a subcommand such as
raven token add
, and tokens will be saved in Redis DB.
How?
- We can create a list of tokens and each time get one randomly.
- Save the rate-limit threshold for each token and each time choose the token with the biggest rate-limit timeout.
@oreenlivnicode @alex-ilgayev WDYT?
It is super nice for crawling mode.
I think the Redis idea might be complicated to support expired tokens or tokens for private organization mode. (A user might want to use a different token).
What about an option to pass a txt file containing a list of tokens? --tokens-file
Have you considered switching the workflow ingest logic to GraphQL?
I recently have been experimenting with it for a tool I maintain that suffers from a similar problem of large org enum being painfully slow (also downloading workflow ymls via the REST API). I still need to integrate into the main branch of the tool, but this src file below covers how I'm forming the query. Essentially I use the API to get all of the repos, and then programmatically create GraphQL queries 100 repos at a time to download yml files within the .github/workflows
directory. It would not be too complicated to use similar logic for raven
.
https://github.com/praetorian-inc/gato/blob/update/checker_and_faster/gato/github/gql_queries.py
Hey @AdnaneKhan using GraphQL API is an interesting idea, we haven't experienced it yet.
About the original ticket, @elad-pticha, did you verified that separate tokens has separated rate limitation?
Maybe it is chosen by the user creating these tokens, and having several ones won't help.
@alex-ilgayev, you are right; the rate limit is associated with the user and not the token.
GitHub best practices for rate limit suggest also monitoring the retry-after
(which we don't).
@AdnaneKhan It also looks like the graphQL rate lime is the same if I am not wrong (source) Is there something I am missing?
@elad-pticha If that is the case so maybe we can close this issue.
We do have some rate limiting checks in get_repository_workflows
(which is the main API call that can be limited):
if r.status_code == 403 and int(r.headers["X-RateLimit-Remaining"]) == 0:
import time
time_to_sleep = int(r.headers["X-RateLimit-Reset"]) - time.time() + 1
log.error(
f"[*] Ratelimit for for contents API depleted. Sleeping {time_to_sleep} seconds"
)
time.sleep(time_to_sleep)
return get_repository_workflows(repo)
@alex-ilgayev, you are right; the rate limit is associated with the user and not the token. GitHub best practices for rate limit suggest also monitoring the
retry-after
(which we don't).@AdnaneKhan It also looks like the graphQL rate lime is the same if I am not wrong (source) Is there something I am missing?
I thought this was the case as well, but in my testing it consumed far less resources than making equivalent REST calls with repository contents API. For example, I was able to pull down all of Microsoft’s public workflows with 15 graphql calls (100 repos at a time) in about 10 minutes with a single account.
Interesting, we will check this out!
I am closing this issue for now.
If we decide to switch to GraphQL requests we should discuss this in another issue.