CycodeLabs/raven

Create GitHub token pool

elad-pticha opened this issue · 8 comments

It can be managed in multiple ways:

  1. --token flag could be specified multiple times, for example:
    raven download org --token $GITHUB_TOKEN_1 --token $GITHUB_TOKEN_2
  2. Add a subcommand such as raven token add, and tokens will be saved in Redis DB.

How?

  1. We can create a list of tokens and each time get one randomly.
  2. Save the rate-limit threshold for each token and each time choose the token with the biggest rate-limit timeout.

It is super nice for crawling mode.
I think the Redis idea might be complicated to support expired tokens or tokens for private organization mode. (A user might want to use a different token).

What about an option to pass a txt file containing a list of tokens? --tokens-file

Have you considered switching the workflow ingest logic to GraphQL?

I recently have been experimenting with it for a tool I maintain that suffers from a similar problem of large org enum being painfully slow (also downloading workflow ymls via the REST API). I still need to integrate into the main branch of the tool, but this src file below covers how I'm forming the query. Essentially I use the API to get all of the repos, and then programmatically create GraphQL queries 100 repos at a time to download yml files within the .github/workflows directory. It would not be too complicated to use similar logic for raven.

https://github.com/praetorian-inc/gato/blob/update/checker_and_faster/gato/github/gql_queries.py

Hey @AdnaneKhan using GraphQL API is an interesting idea, we haven't experienced it yet.

About the original ticket, @elad-pticha, did you verified that separate tokens has separated rate limitation?
Maybe it is chosen by the user creating these tokens, and having several ones won't help.

@alex-ilgayev, you are right; the rate limit is associated with the user and not the token.
GitHub best practices for rate limit suggest also monitoring the retry-after (which we don't).

@AdnaneKhan It also looks like the graphQL rate lime is the same if I am not wrong (source) Is there something I am missing?

@elad-pticha If that is the case so maybe we can close this issue.
We do have some rate limiting checks in get_repository_workflows (which is the main API call that can be limited):

    if r.status_code == 403 and int(r.headers["X-RateLimit-Remaining"]) == 0:
        import time

        time_to_sleep = int(r.headers["X-RateLimit-Reset"]) - time.time() + 1
        log.error(
            f"[*] Ratelimit for for contents API depleted. Sleeping {time_to_sleep} seconds"
        )
        time.sleep(time_to_sleep)
        return get_repository_workflows(repo)

@alex-ilgayev, you are right; the rate limit is associated with the user and not the token. GitHub best practices for rate limit suggest also monitoring the retry-after (which we don't).

@AdnaneKhan It also looks like the graphQL rate lime is the same if I am not wrong (source) Is there something I am missing?

I thought this was the case as well, but in my testing it consumed far less resources than making equivalent REST calls with repository contents API. For example, I was able to pull down all of Microsoft’s public workflows with 15 graphql calls (100 repos at a time) in about 10 minutes with a single account.

Interesting, we will check this out!
I am closing this issue for now.

If we decide to switch to GraphQL requests we should discuss this in another issue.