mozilla-services/contile

Investigate contile 1.9.0 load test results

data-sync-user opened this issue · 7 comments

The latest load test against 1.9.0 isn’t passing compared to a rerun of 1.8.1:

1.8.1:

!Screen Shot 2022-11-03 at 4.15.12 PM.png|width=602,height=296!

1.9.0:

!Screen Shot 2022-11-03 at 4.15.20 PM.png|width=600,height=304!

There’s a constant rate of 502s (as opposed to none in 1.8.1) then 3 large spikes of them.

Load Test History: https://docs.google.com/document/d/10Hx4cGvGBvq0z0uOK_CG3ZcyaQnT_EtgR6MYXmIvG6Q/edit

┆Issue is synchronized with this Jira Task
┆Attachments: Screen Shot 2022-11-03 at 4.15.12 PM.png | Screen Shot 2022-11-03 at 4.15.20 PM.png | Screen Shot 2022-11-10 at 6.08.50 PM.png | Screen Shot 2022-11-10 at 6.48.20 PM.png | Screen Shot 2022-11-15 at 4.40.25 PM.png | Screen Shot 2022-11-15 at 4.43.00 PM.png | Screen Shot 2022-11-29 at 1.10.14 PM.png | Screen Shot 2022-11-29 at 1.14.51 PM.png | Screen Shot 2022-11-29 at 4.05.31 PM.png

➤ Philip Jenvey commented:

The LB reports numerous backend_connection_closed_before_data_sent_to_client statuses during the steady stream of 502s, and also some failed_to_pick_backend's. Then mostly failed_to_pick_backend's during the 2 steep dropoffs.

➤ Philip Jenvey commented:

Despite the numerous 502s errors, contile 1.9.0's handling more requests per second than 1.8.1 -- around 2.1k vs 1.7k in 1.8.1 (w/ one pod), while also using a little less CPU.

However nginx uses more CPU to keep up, exceeding its k8s request of 400m (with no limit). Here’s a 1.8.1 run on the left vs 1.9.0 on the right:

!Screen Shot 2022-11-10 at 6.08.50 PM.png|width=884,height=303!

The large dropoff during the load test is due to nginx's liveness heartbeat (the /nginixheartbeat endpoint) failing, triggering k8s to restart it. Its response times also seemed to begin degrading shortly before that whereas contile's didn't.

So I think nginx needs more compute here (or to do less work), next steps:

  • I noticed contile uses the default actix-web keep-alive (5s), let's try setting this to 630s (on par w/ nginx's), in hopes it reduces its workload
  • If that doesn't help, I also noticed stage is using n1-highcpu-2 instances whereas prod uses 8 - these should probably match anyway? and should give nginx more room
    • Then I think we'll need to bump its cpu request anyway to be more accurate

It’s not clear to me why no autoscaling kicked in though (does stage not autoscale?).

➤ Philip Jenvey commented:

Additional context, combined 1.9.0 nginx+contile CPU utilization just before the 502 spike:

!Screen Shot 2022-11-15 at 4.40.25 PM.png|width=909,height=306!

vs 1.8.1’s typical utilization:

!Screen Shot 2022-11-15 at 4.43.00 PM.png|width=618,height=305!

➤ Philip Jenvey commented:

Bumping the CPU (cloudops-infra#4504 ( https://github.com/mozilla-services/cloudops-infra/pull/4504 )) has solved this:

!Screen Shot 2022-11-29 at 1.10.14 PM.png|width=600,height=296!

1.9.0 does around 3.1k req/second with this adjustment, which also bumped nginx’s cpu request from 400m → 500m, however both nginx and now contile are using more than their request:

!Screen Shot 2022-11-29 at 1.14.51 PM.png|width=678,height=287!

CC Dustin Lactin, I think this suggests we should bump both cpu requests further? E.g. nginx 500m → 850m, etc: https://github.com/mozilla-services/cloudops-infra/compare/contile-cpu-request ( https://github.com/mozilla-services/cloudops-infra/compare/contile-cpu-request )

➤ Dustin Lactin commented:

Philip Jenvey I think increasing the requests is a good idea, but as long as there are no limits set the pod will consume all the CPU available to it.

Looking at the workload metrics, it was right around the 2-core limit of the old node type (n1-highcpu-2) during load testing.

!Screen Shot 2022-11-29 at 4.05.31 PM.png|width=1202,height=428!

With the new (n1-highcpu-4) node type, there should be some excess capacity.

➤ Janet Dragojevic commented:

Philip Jenvey do you have an update for this issue?

➤ Philip Jenvey commented:

With the cpu request changed applied earlier, this issue is now completed.

I closed the WIP keep-alive ( https://github.com/mozilla-services/cloudops-infra/pull/4496 ) PR but logged https://mozilla-hub.atlassian.net/browse/DISCO-2148 ( https://mozilla-hub.atlassian.net/browse/DISCO-2148|smart-link ) for revisiting it later.

Then we successfully rolled out 1.9.0 to stage/production this afternoon.