juice-shop/multi-juicer

K8s autoscaling - balancer timeouts

dmspils opened this issue · 1 comments

I am running multi-juicer on Google Kubernetes Engine (GKE) and have spotted a bug when running large events.

Spinning up a new container of juice shop on a node that has capacity takes around 22s. But, when a node is at full capacity and you spin up a new container, it takes about 1m30s for the new node to become available and the container to launch. However, in that time, the balancer times out with:

GET https://training.test.appsec.tools.bbc.co.uk/balancer/teams/7tguy/wait-till-ready 502
Failed to wait for deployment readiness

You can still log out of the balancer and log back in to your team with your password, but you need to know that you need to do that, and in large team events, the unlucky few who hit this bug don't know that they need to do that.

To replicate:

  1. Operate a GKE cluster that is sized appropriately for multi-juicer i.e. the default node pool is running with very little spare CPU capacity.
  2. Launch another multi-juicer team (juice-shop container); this will force the cluster to autoscale and add a new node.
  3. Looking at the multi-juicer balancer with dev tools open, errors will be reported after about 1m.
  4. Watch the cluster resources page in GKE. The new node and associated juice-shop container should take about 1m30s to become operational.
  5. Back to the multi-juicer balancer with dev tools, the page will fail to refresh/reload and will simply fail at the page saying "Starting a new Juice Shop Instance" with the spinner spinning indefinitely.

Oh that's an interesting case.
I seen timeouts happen before, I then increased the timeout duration settings from 1 to 3 minutes. The timeouts are hard coded at the moment, so this should also apply to your stack. Is it possible that the GKE Loadbalancer has a default timeout of a minute?

Anyway, the timeouts in the JuiceBalancer are still not handled very cleanly, especially on the frontend side, its good that you're opening an issue for that will try to address that soon.