octokit/app.js

Improving API use efficiency for large apps

rarkins opened this issue · 4 comments

Background:

  • I want my app to "resync" its list of installations + repositories hourly, in case it missed or mishandled any installation webhooks
  • The app has 10k+ installations, hundreds of thousands of repos, so this takes up a LOT of time and API requests

I'm wondering if there are any existing ways to perform such a "sync" more efficiently - whether exposed currently through this library or not.

Starting with the GET /app/installations endpoint to list all installations. If we assume that the list of installations changes most hours between sync's, then caching the previous ETag in the hope of getting back a 304 Not Modified won't help much.

So the next question is: can we somehow just get the ones which added/removed/changed within the last hour, and not have to paginate 100+ times? The docs include since and outdated parameters but no linked documentation. I see from the Issues API docs it mentions for since:

Only show notifications updated after the given time. This is a timestamp in ISO 8601 format: YYYY-MM-DDTHH:MM:SSZ.

So this could maybe be used for finding added/modified since the last run, but leaves the problem of synchronizing deleted. Perhaps a pragmatic solution would be:

  • Try not to miss or mishandle "installation deleted" webhooks
  • Try to intelligently detect when an installation is returning 404 and check it individually?

Any better ideas to routinely "sync the list of installations for an app"?

Next step: sync'ing the list of installed repositories.

https://docs.github.com/en/rest/reference/apps#list-repositories-accessible-to-the-app-installation

First question: can we use the "since" result in the earlier querying to know if the list of repositories is unchanged?

If we can't, then is there any way to avoid one GET request per installation?

Then, when we do list the repositories per installation, is there a way to optimize that to reduce transferred data and/or pagination?

The example results in the docs show time stamps in use:

      "pushed_at": "2011-01-26T19:06:43Z",
      "created_at": "2011-01-26T19:01:12Z",
      "updated_at": "2011-01-26T19:14:43Z",

This would mean that the chance that no time stamp changed for any repo is quite low, in which case reusing the previous ETag for that installation doesn't help much.

I also don't see any "since" parameter in this API.

If there was a way to sort results by last updated then we could at least then stop paginating results once we see an update_at time stamp which is earlier than the last time we sync'd.

We also have the challenge of working out which repositories were uninstalled.

gr2m commented

I'm not aware of any better approach with current APIs. A good place to ask would be https://github.community/ or send a message to support at https://support.github.com/contact

I won't be able to dig into it myself I'm afraid. GitHub didn't extend my contract with only 5 days notice, Octokit has no maintainers effectively tomorrow 🤷🏼

Thanks, @gr2m. I think that if neither of us are aware of a better way and you are moving on, it's most pragmatic to close this issue. Thanks for the great work you put into these libraries.