[Fixed] Webinstall.dev was down for several minutes.
coolaj86 opened this issue · 2 comments
coolaj86 commented
Problem
17 minutes of downtime today June 5th from 18:39 to 18:56 UTC.
Retrospective
What happened:
- There was a typo in the
Authorization
header, so authorization was not correctly sent - Production makes many requests, quickly reaching the rate limits (which cannot easily be mocked in testing)
- The error is being thrown in an async function, which caused server restart
- The server refreshes at least one random package on start, which caused failure on start
- Successive failures in rapid succession caused
systemctl
to abort relaunching
What to do about:
- Fix the typo.
This passed review without notice. It couldn't have been reasonably caught in testing. The typo was a valid word, so it wasn't caught by spellcheck either. 🤷♂️ As humans we make mistakes. - Reconsider the error handling.
Not sure if this category of error should cause this level of failure or not. The severity of the failure made it easy to identify and, since a user can't directly invoke this sort of failure remotely, it doesn't seems to present an attack vector. - More time between hotfixes and refactors.
"While we're here, might as well..." was the really root cause. It was not necessary to switch to usingfetch
(#852) in order to solve #850, #851. Even thought the commits were distinct, the process was not. If I had waited for a truly separate review on that change, the error would certainly have been more likely to be caught (i.e. review fatigue).
Status Updates
Not sure why yet. Investigating.
Possibly related to the change in fetching github releases and a difference between the staging and production environment.
coolaj86 commented
{ "message": "API rate limit exceeded for 128.199.9.106. (But here's the good news: Authentic ated requests get a higher rate limit. Check out the documentation for more details.)", "documentation_url": "https://docs.github.com/rest/overview/re sources-in-the-rest-api#rate-limiting" }
This is unexpected as the adjacent logs also indicates the username, which implies that the token was being used.
Also strange that a restart of the server "fixed" it.
coolaj86 commented
Typo in the authorization header.