Upjet providers can't consume workqueue fast enough. Causes huge time-to-readiness delay

Question

Upjet providers can't consume workqueue fast enough. Causes huge time-to-readiness delay

Closed this issue 2 years ago · 9 comments

Cross-posting about crossplane/terrajet#300, because the exact same behavior happens with Upjet, and as far as I understood, terrajet will be deprecated in favor of Upjet, so it makes sense to keep this issue tracked here.

This is specially relevant as this now seems to be the "official" backend for provider implementations.

What happened?

The expected behaviour is that upjet resources time-to-readiness wouldn't depend on the amount of resources that already exist in the cluster.

In reality, as calling terraform takes a while (around 1 second on my tests), the provider controller is unable to clear the work queue and because of that, any new events (such as creating a new resource) takes very long to complete when there are multiple other resources, since the controller adds those to the end of the queue.

There are more details and on the original bug report

How can we reproduce it?

The reproduction steps are basically the same as the original issue, just changing the terrajet provider for the upjet provider.

Create a new kubernetes cluster (with kind or in the cloud).
Install crossplane
Install an upjet provider (I'll use AWS because it can be easily compared with the native provider-aws)
Create a handful of resources to be managed by upjet (I'll use iam Policy because it is quickly created and dont incur costs)
Wait until all resources are created and ready
it will take some minutes, but a burst of resources is expected to take a bit.
Althought it does take much longer than provider-aws for the same resource.
Create one more resource to be managed by upjet

The last step will take a long time, which is the problem this bug report is about.

Open collapsible for reproducible commands

# Create kind cluster
kind create cluster --name upjet-load --image kindest/node:v1.23.10

# Install crossplane with helm
kubectl create namespace crossplane-system
helm repo add crossplane-stable https://charts.crossplane.io/stable
helm repo update
helm upgrade --install crossplane --namespace crossplane-system crossplane-stable/crossplane

# Install AWS upjet provider
kubectl apply -f - <<YAML
apiVersion: pkg.crossplane.io/v1
kind: Provider
metadata:
  name: crossplane-provider-jet-aws
spec:
  controllerConfigRef:
    name: config
  package: xpkg.upbound.io/upbound/provider-aws:v0.18.0
---
apiVersion: pkg.crossplane.io/v1alpha1
kind: ControllerConfig
metadata:
  name: config
spec:
  args:
  - --debug
YAML
## use your aws credentials in the secret, if you have a custom way to interact with AWS, change the credentials key of this secret.
kubectl create secret generic -n crossplane-system --from-file=credentials=$HOME/.aws/credentials aws-credentials
kubectl apply -f - <<YAML
apiVersion: aws.upbound.io/v1beta1
kind: ProviderConfig
metadata:
  name: default
spec:
  credentials:
    source: Secret
    secretRef:
      name: aws-credentials
      namespace: crossplane-system
      key: credentials
YAML

# Create a handful of resources managed by upjet. I chose policies because they are created near-instantly in the cloud and don't incur costs
## Note: seq on MacOS doesn't seem to support the -w flag, it can be removed safely below
for n in $(seq -w 200); do
  echo "---"
  sed "s/NUMBER/$n/" <<YAML
apiVersion: iam.aws.upbound.io/v1beta1
kind: Policy
metadata:
  name: upboundNUMBER
spec:
  providerConfigRef:
    name: default
  forProvider:
    policy: |
      {
        "Version": "2012-10-17",
        "Statement": [
          {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "iam:GetPolicy",
            "Resource": "*"
          }
        ]
      }
YAML
done | kubectl apply -f -

# Wait until all are ready, it took about 15 minutes for me
kubectl get policies.iam.aws.upbound.io

# Create one more resource
kubectl apply -f - <<YAML
apiVersion: iam.aws.upbound.io/v1beta1
kind: Policy
metadata:
  name: upbound-slow
spec:
  providerConfigRef:
    name: default
  forProvider:
    policy: |
      {
        "Version": "2012-10-17",
        "Statement": [
          {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "iam:GetPolicy",
            "Resource": "*"
          }
        ]
      }
YAML

# It takes a long time for the resource to become ready.
kubectl get policies.iam.aws.upbound.io upbound-slow

########

# Cleanup cloud resources
kubectl delete policies.iam.aws.upbound.io $(kubectl get policies.iam.aws.upbound.io | grep upbound-slow | awk '{print $1}')

# Delete kind cluster
kind destroy cluster --name upjet-load

Answer 1 · 2022-10-21T05:07:56.000Z

Thank you for the detailed report @Kasama, we'll be taking a look at this in our next sprint starting next week.

Answer 2 · 2022-10-21T14:03:12.000Z

Great to hear that! Feel free to reach out either here or on crossplane's slack thread (@roberto.alegro) if I can help with more details or reproduction steps

Answer 3 · 2022-11-01T01:49:36.000Z

Probably related to crossplane-contrib/provider-upjet-aws#86

Answer 4 · 2022-11-01T19:18:22.000Z

Thanks a lot for your detailed analysis here @Kasama.

I believe a low-hanging fruit here is to set some reasonable defaults for MaxConcurrentReconciles and PollInterval configurations to cover common cases and then to ensure they are exposed as configuration parameters so that they can be tuned further depending on specific deployment cases.

Answer 5 · 2022-11-07T10:56:06.000Z

FYI #99 is also another thing that may cause CPU saturation.

Answer 6 · 2022-11-08T14:45:05.000Z

On a GKE cluster with e2-standard-4 machine type, I did run the following 3 experiments:

Experiment 1: With maxConcurrentReconciles=1 and pollInterval=1m (Current defaults)
Experiment 2: With maxConcurrentReconciles=10 and pollInterval=1m (Community providers defaults)
Experiment 3: With maxConcurrentReconciles=10 and pollInterval=10m (Proposed defaults)

There are definitely some improvements between Exp#1 and Exp#2 but TBH, I am a bit surprised with Exp#3 not being much different than Exp#2. I am wondering if this could be related to CPU being throttled for both cases. I am planning to repeat the two Experiments with larger nodes not to get throttled.

Experiment 1: With maxConcurrentReconciles=1 and pollInterval=1m (Current defaults)

Provisioned 100 ECR Repositories and it took ~19 mins until all of them become Ready.
Wanted to delete 1 of them and took 9min until it got processed first time and ~18 min until deleted.

Experiment 2: With maxConcurrentReconciles=10 and pollInterval=1m (Community defaults)

Provisioned 100 ECR Repositories and it took ~12 Mins until all of them become Ready.
Wanted to delete 1 of them and took ~1 min until it got processed first time and ~5 mins until deleted.

Experiment 3: With maxConcurrentReconciles=10 and pollInterval=10m (Proposed defaults)

Provisioned 100 ECR Repositories and it took ~12 Mins until all of them become Ready.
Wanted to delete 1 of them and took ~1 min until it got processed first time and ~4 mins until deleted.

Answer 7 · 2022-11-08T23:51:33.000Z

Yeah, during my testing I've walked a similar path, I had changed pollInterval, but assumed it didn't have an impact. Maybe an even bigger interval, like 30m or 1h would yield different results, but that starts to become a bit unreasonable imo.

Indeed there are some improvements when bumping the concurrency, but sadly the problem still remains in that the time it takes for new resources to be ready greatly depends on the amount of already existing resources.

Answer 8 · 2022-11-09T15:32:23.000Z

I repeated the last experiment on a bigger node (e2-standard-32) to eliminate the effect of CPU throttling and this time it looks much better (except the resource consumption).

Experiment 4: With maxConcurrentReconciles=10 and pollInterval=10m (Proposed defaults) (on e2-standard-32)

Provisioned 100 ECR Repositories and it took ~2 Mins until all of them become Ready.
Wanted to delete 1 of them and took ~5 secs until it got processed first time and ~10 secs until deleted.

I believe improving resource usage is something orthogonal with the settings here and I feel good with the above defaults while still exposing them as configurable params.

I'll open PRs with proposed defaults.

Answer 9 · 2022-12-02T13:22:39.000Z

I was finally able to do some more tests using a bigger instance (an m6i.8xlarge on aws in my case, similar to the e2-standard-32 you used). And I can confirm that running in a bigger instance does make the queue clear faster with the new defaults.

But when trying with ~5000 concurrent resources, there was still a similar problem and the queue was with ~700 resources at all times. That can be mitigated again by increasing the reconciliation time, but It would be much better to have a way to scale these controllers horizontally, though.