atom/teletype-server

Enable auto-scaling on Heroku

as-cii opened this issue · 8 comments

I have enabled auto-scaling on atom-tachyon by:

  • Switching to a Performance-M dyno (https://devcenter.heroku.com/articles/dyno-types#available-dyno-types).
  • Setting a minimum of 1 dyno.
  • Setting a maximum of 4 dynos.
  • Setting the scaling strategy to achieve 500ms response time on 95th percentile.
  • Enabling e-mail notifications when dynos can't be scaled up if they exceed the aforementioned maximum.

screen shot 2017-11-03 at 12 18 52

/cc: @jasonrudolph @nathansobo

Please, note that I updated the scaling strategy to achieve 500ms response time (instead of 300ms) on 95th percentile, because we were already getting e-mails about not having enough dynos (even with super low traffic).

Please, note that I updated the scaling strategy to achieve 500ms response time (instead of 300ms) on 95th percentile, because we were already getting e-mails about not having enough dynos (even with super low traffic).

We've received two more of these emails after updating the scaling strategy. We got one email on Nov 4 (Saturday) and one today. @as-cii: Do you have any thoughts regarding adjustments that we should make?

I think we have two options (not necessarily alternative to each other):

  • Allowing a higher maximum number of dynos. I am not sold on this yet, because it seems very unlikely that we need more than 4 dynos at the moment.
  • Increasing 95th percentile response time. I think it might be okay for a small portion of users to wait slightly longer to create or join a portal. Maybe a 1s threshold could be an acceptable value.

Happy to discuss this further once you are around later today.

To mitigate this, we've set the 95th percentile response time to 750ms and made the /_ping endpoint check the status of services in parallel. We'll find out if these two actions are sufficient as we see more traffic.

I'm pretty confused about why our response time wouldn't be lightning fast in the current setup. We barely do anything on these code paths. 😕

I'm pretty confused about why our response time wouldn't be lightning fast in the current setup.

@nathansobo: @as-cii and I looked at the New Relic dashboard today to try to get some insight. Based on that research, I'm pretty sure the slow responses that we saw over the weekend were due to the GET /identity endpoint getting a slow response from the GitHub API.

For requests that only need to access the PostgreSQL database, I expect a lightning fast response.

For requests that hit Twilio or the GitHub API, a delayed response from those services will result in a delayed response from teletype-server.

So it sounds like we're I/O bound. I have very little context so this might be obvious or off the mark, but we should be careful about scaling up dynos in response to slow response times.

So it sounds like we're I/O bound. I have very little context so this might be obvious or off the mark, but we should be careful about scaling up dynos in response to slow response times.

Agreed, that's why we've cranked up the response time of the 95th percentile to 1s. In other words, Heroku will start scaling up only when the 95th percentile response time exceeds one second. It will scale down shortly afterward if it realizes that it was just a temporary spike.

@jasonrudolph and I were discussing how it is suboptimal that 95th percentile response time is the only parameter auto-scaling is based on, but I guess we'll have to live with it for now. Post-launch we can figure out if this solution still fits us, or if we should look into something like http://hirefire.io:

For web-based dynos we support the following metrics types:

  • Response Time (percentile, average)
  • Connect Time (percentile, average)
  • Dyno Load (average)
  • Requests Per Minute
  • Apdex Score