Terrajet providers take a long time on cluster with many resources
Kasama opened this issue · 2 comments
Hello team!
Me and my team are developing a solution for developers to be able to manage infrastructure resources through an API. This solution relies on crossplane to manage cloud resources as kubernetes CRs.
We've started using terrajet for AWS, but changed to the native provider because of this issue. New we also need to manage other resources for which there are no native providers (namely vault, boundary and datadog) and it would be great to use terrajet.
Existing Context
initial slack thread with problem description
What happened?
The expected behaviour is that terrajet resources time-to-readiness wouldn't depend on the amount of resources that already exist in the cluster.
In reality, the terrajet controller takes roughly 3
* (number of existing resources)
* (time it takes for one resource to be reconciled)
seconds to consider a new resource ready.
The examples in this issue are using provider-jet-aws
, but all terrajet providers have the same behaviour, from what I can see.
Below are the debug logs related to one recently created policy (terrajetloadtestslow
). It was created on a provider that is already managing 200 other aws policies.
crossplane-provider-jet-aws-599c4ea494ca-55b9dd885f-zr2pb provider-jet-aws 1.6630252994777613e+09 DEBUG provider-jet-aws Reconciling {"controller": "managed/iam.aws.jet.crossplane.io/v1alpha2, kind=policy", "request": "/terrajetloadtestslow"}
crossplane-provider-jet-aws-599c4ea494ca-55b9dd885f-zr2pb provider-jet-aws 1.6630253028353422e+09 DEBUG provider-jet-aws refresh ended {"workspace": "/tmp/636bbd3e-94bb-40ae-acde-f866e095b0e6", "out": "{\"@level\":\"info\",\"@message\":\"Terraform 1.0.5\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:28:20.558331Z\",\"terraform\":\"1.0.5\",\"type\":\"version\",\"ui\":\"0.1.0\"}\n{\"@level\":\"info\",\"@message\":\"aws_iam_policy.terrajetloadtestslow: Refreshing state...\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:28:22.822179Z\",\"hook\":{\"resource\":{\"addr\":\"aws_iam_policy.terrajetloadtestslow\",\"module\":\"\",\"resource\":\"aws_iam_policy.terrajetloadtestslow\",\"implied_provider\":\"aws\",\"resource_type\":\"aws_iam_policy\",\"resource_name\":\"terrajetloadtestslow\",\"resource_key\":null},\"id_key\":\"id\"},\"type\":\"refresh_start\"}\n{\"@level\":\"info\",\"@message\":\"aws_iam_policy.terrajetloadtestslow: Refresh complete\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:28:22.822992Z\",\"hook\":{\"resource\":{\"addr\":\"aws_iam_policy.terrajetloadtestslow\",\"module\":\"\",\"resource\":\"aws_iam_policy.terrajetloadtestslow\",\"implied_provider\":\"aws\",\"resource_type\":\"aws_iam_policy\",\"resource_name\":\"terrajetloadtestslow\",\"resource_key\":null}},\"type\":\"refresh_complete\"}\n{\"@level\":\"info\",\"@message\":\"aws_iam_policy.terrajetloadtestslow: Drift detected (delete)\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:28:22.823374Z\",\"change\":{\"resource\":{\"addr\":\"aws_iam_policy.terrajetloadtestslow\",\"module\":\"\",\"resource\":\"aws_iam_policy.terrajetloadtestslow\",\"implied_provider\":\"aws\",\"resource_type\":\"aws_iam_policy\",\"resource_name\":\"terrajetloadtestslow\",\"resource_key\":null},\"action\":\"delete\"},\"type\":\"resource_drift\"}\n{\"@level\":\"info\",\"@message\":\"Plan: 0 to add, 0 to change, 0 to destroy.\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:28:22.823487Z\",\"changes\":{\"add\":0,\"change\":0,\"remove\":0,\"operation\":\"plan\"},\"type\":\"change_summary\"}\n{\"@level\":\"info\",\"@message\":\"Apply complete! Resources: 0 added, 0 changed, 0 destroyed.\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:28:22.829034Z\",\"changes\":{\"add\":0,\"change\":0,\"remove\":0,\"operation\":\"apply\"},\"type\":\"change_summary\"}\n{\"@level\":\"info\",\"@message\":\"Outputs: 0\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:28:22.829081Z\",\"outputs\":{},\"type\":\"outputs\"}\n"}
crossplane-provider-jet-aws-599c4ea494ca-55b9dd885f-zr2pb provider-jet-aws 1.6630253059385006e+09 DEBUG provider-jet-aws apply ended {"workspace": "/tmp/636bbd3e-94bb-40ae-acde-f866e095b0e6", "out": "{\"@level\":\"info\",\"@message\":\"Terraform 1.0.5\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:28:22.897931Z\",\"terraform\":\"1.0.5\",\"type\":\"version\",\"ui\":\"0.1.0\"}\n{\"@level\":\"info\",\"@message\":\"aws_iam_policy.terrajetloadtestslow: Plan to create\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:28:25.127798Z\",\"change\":{\"resource\":{\"addr\":\"aws_iam_policy.terrajetloadtestslow\",\"module\":\"\",\"resource\":\"aws_iam_policy.terrajetloadtestslow\",\"implied_provider\":\"aws\",\"resource_type\":\"aws_iam_policy\",\"resource_name\":\"terrajetloadtestslow\",\"resource_key\":null},\"action\":\"create\"},\"type\":\"planned_change\"}\n{\"@level\":\"info\",\"@message\":\"Plan: 1 to add, 0 to change, 0 to destroy.\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:28:25.127975Z\",\"changes\":{\"add\":1,\"change\":0,\"remove\":0,\"operation\":\"plan\"},\"type\":\"change_summary\"}\n{\"@level\":\"info\",\"@message\":\"aws_iam_policy.terrajetloadtestslow: Creating...\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:28:25.630013Z\",\"hook\":{\"resource\":{\"addr\":\"aws_iam_policy.terrajetloadtestslow\",\"module\":\"\",\"resource\":\"aws_iam_policy.terrajetloadtestslow\",\"implied_provider\":\"aws\",\"resource_type\":\"aws_iam_policy\",\"resource_name\":\"terrajetloadtestslow\",\"resource_key\":null},\"action\":\"create\"},\"type\":\"apply_start\"}\n{\"@level\":\"info\",\"@message\":\"aws_iam_policy.terrajetloadtestslow: Creation complete after 0s [id=arn:aws:iam::ACCOUNT_NUMBER:policy/terrajetloadtestslow]\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:28:25.924980Z\",\"hook\":{\"resource\":{\"addr\":\"aws_iam_policy.terrajetloadtestslow\",\"module\":\"\",\"resource\":\"aws_iam_policy.terrajetloadtestslow\",\"implied_provider\":\"aws\",\"resource_type\":\"aws_iam_policy\",\"resource_name\":\"terrajetloadtestslow\",\"resource_key\":null},\"action\":\"create\",\"id_key\":\"id\",\"id_value\":\"arn:aws:iam::ACCOUNT_NUMBER:policy/terrajetloadtestslow\",\"elapsed_seconds\":0},\"type\":\"apply_complete\"}\n{\"@level\":\"info\",\"@message\":\"Apply complete! Resources: 1 added, 0 changed, 0 destroyed.\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:28:25.932091Z\",\"changes\":{\"add\":1,\"change\":0,\"remove\":0,\"operation\":\"apply\"},\"type\":\"change_summary\"}\n{\"@level\":\"info\",\"@message\":\"Outputs: 0\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:28:25.932195Z\",\"outputs\":{},\"type\":\"outputs\"}\n"}
crossplane-provider-jet-aws-599c4ea494ca-55b9dd885f-zr2pb provider-jet-aws 1.663025305947192e+09 DEBUG provider-jet-aws Successfully requested creation of external resource {"controller": "managed/iam.aws.jet.crossplane.io/v1alpha2, kind=policy", "request": "/terrajetloadtestslow", "uid": "636bbd3e-94bb-40ae-acde-f866e095b0e6", "version": "573953565", "external-name": "", "external-name": "arn:aws:iam::ACCOUNT_NUMBER:policy/terrajetloadtestslow"}
crossplane-provider-jet-aws-599c4ea494ca-55b9dd885f-zr2pb provider-jet-aws 1.6630253059475644e+09 DEBUG events Normal {"object": {"kind":"Policy","name":"terrajetloadtestslow","uid":"636bbd3e-94bb-40ae-acde-f866e095b0e6","apiVersion":"iam.aws.jet.crossplane.io/v1alpha2","resourceVersion":"573965831"}, "reason": "CreatedExternalResource", "message": "Successfully requested creation of external resource"}
crossplane-provider-jet-aws-599c4ea494ca-55b9dd885f-zr2pb provider-jet-aws 1.663025826864625e+09 DEBUG provider-jet-aws Reconciling {"controller": "managed/iam.aws.jet.crossplane.io/v1alpha2, kind=policy", "request": "/terrajetloadtestslow"}
crossplane-provider-jet-aws-599c4ea494ca-55b9dd885f-zr2pb provider-jet-aws 1.6630258282078466e+09 DEBUG provider-jet-aws refresh ended {"workspace": "/tmp/636bbd3e-94bb-40ae-acde-f866e095b0e6", "out": "{\"@level\":\"info\",\"@message\":\"Terraform 1.0.5\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:37:06.895864Z\",\"terraform\":\"1.0.5\",\"type\":\"version\",\"ui\":\"0.1.0\"}\n{\"@level\":\"info\",\"@message\":\"aws_iam_policy.terrajetloadtestslow: Refreshing state... [id=arn:aws:iam::ACCOUNT_NUMBER:policy/terrajetloadtestslow]\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:37:08.112383Z\",\"hook\":{\"resource\":{\"addr\":\"aws_iam_policy.terrajetloadtestslow\",\"module\":\"\",\"resource\":\"aws_iam_policy.terrajetloadtestslow\",\"implied_provider\":\"aws\",\"resource_type\":\"aws_iam_policy\",\"resource_name\":\"terrajetloadtestslow\",\"resource_key\":null},\"id_key\":\"id\",\"id_value\":\"arn:aws:iam::ACCOUNT_NUMBER:policy/terrajetloadtestslow\"},\"type\":\"refresh_start\"}\n{\"@level\":\"info\",\"@message\":\"aws_iam_policy.terrajetloadtestslow: Refresh complete [id=arn:aws:iam::ACCOUNT_NUMBER:policy/terrajetloadtestslow]\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:37:08.197666Z\",\"hook\":{\"resource\":{\"addr\":\"aws_iam_policy.terrajetloadtestslow\",\"module\":\"\",\"resource\":\"aws_iam_policy.terrajetloadtestslow\",\"implied_provider\":\"aws\",\"resource_type\":\"aws_iam_policy\",\"resource_name\":\"terrajetloadtestslow\",\"resource_key\":null},\"id_key\":\"id\",\"id_value\":\"arn:aws:iam::ACCOUNT_NUMBER:policy/terrajetloadtestslow\"},\"type\":\"refresh_complete\"}\n{\"@level\":\"info\",\"@message\":\"Plan: 0 to add, 0 to change, 0 to destroy.\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:37:08.198356Z\",\"changes\":{\"add\":0,\"change\":0,\"remove\":0,\"operation\":\"plan\"},\"type\":\"change_summary\"}\n{\"@level\":\"info\",\"@message\":\"Apply complete! Resources: 0 added, 0 changed, 0 destroyed.\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:37:08.203956Z\",\"changes\":{\"add\":0,\"change\":0,\"remove\":0,\"operation\":\"apply\"},\"type\":\"change_summary\"}\n{\"@level\":\"info\",\"@message\":\"Outputs: 0\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:37:08.203989Z\",\"outputs\":{},\"type\":\"outputs\"}\n"}
crossplane-provider-jet-aws-599c4ea494ca-55b9dd885f-zr2pb provider-jet-aws 1.6630258282080562e+09 DEBUG provider-jet-aws External resource is up to date {"controller": "managed/iam.aws.jet.crossplane.io/v1alpha2, kind=policy", "request": "/terrajetloadtestslow", "uid": "636bbd3e-94bb-40ae-acde-f866e095b0e6", "version": "573965832", "external-name": "arn:aws:iam::ACCOUNT_NUMBER:policy/terrajetloadtestslow", "requeue-after": 1663025888.2080543}
crossplane-provider-jet-aws-599c4ea494ca-55b9dd885f-zr2pb provider-jet-aws 1.6630263118990378e+09 DEBUG provider-jet-aws Reconciling {"controller": "managed/iam.aws.jet.crossplane.io/v1alpha2, kind=policy", "request": "/terrajetloadtestslow"}
crossplane-provider-jet-aws-599c4ea494ca-55b9dd885f-zr2pb provider-jet-aws 1.6630263132372632e+09 DEBUG provider-jet-aws refresh ended {"workspace": "/tmp/636bbd3e-94bb-40ae-acde-f866e095b0e6", "out": "{\"@level\":\"info\",\"@message\":\"Terraform 1.0.5\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:45:11.952075Z\",\"terraform\":\"1.0.5\",\"type\":\"version\",\"ui\":\"0.1.0\"}\n{\"@level\":\"info\",\"@message\":\"aws_iam_policy.terrajetloadtestslow: Refreshing state... [id=arn:aws:iam::ACCOUNT_NUMBER:policy/terrajetloadtestslow]\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:45:13.135934Z\",\"hook\":{\"resource\":{\"addr\":\"aws_iam_policy.terrajetloadtestslow\",\"module\":\"\",\"resource\":\"aws_iam_policy.terrajetloadtestslow\",\"implied_provider\":\"aws\",\"resource_type\":\"aws_iam_policy\",\"resource_name\":\"terrajetloadtestslow\",\"resource_key\":null},\"id_key\":\"id\",\"id_value\":\"arn:aws:iam::ACCOUNT_NUMBER:policy/terrajetloadtestslow\"},\"type\":\"refresh_start\"}\n{\"@level\":\"info\",\"@message\":\"aws_iam_policy.terrajetloadtestslow: Refresh complete [id=arn:aws:iam::ACCOUNT_NUMBER:policy/terrajetloadtestslow]\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:45:13.226661Z\",\"hook\":{\"resource\":{\"addr\":\"aws_iam_policy.terrajetloadtestslow\",\"module\":\"\",\"resource\":\"aws_iam_policy.terrajetloadtestslow\",\"implied_provider\":\"aws\",\"resource_type\":\"aws_iam_policy\",\"resource_name\":\"terrajetloadtestslow\",\"resource_key\":null},\"id_key\":\"id\",\"id_value\":\"arn:aws:iam::ACCOUNT_NUMBER:policy/terrajetloadtestslow\"},\"type\":\"refresh_complete\"}\n{\"@level\":\"info\",\"@message\":\"Plan: 0 to add, 0 to change, 0 to destroy.\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:45:13.227599Z\",\"changes\":{\"add\":0,\"change\":0,\"remove\":0,\"operation\":\"plan\"},\"type\":\"change_summary\"}\n{\"@level\":\"info\",\"@message\":\"Apply complete! Resources: 0 added, 0 changed, 0 destroyed.\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:45:13.232695Z\",\"changes\":{\"add\":0,\"change\":0,\"remove\":0,\"operation\":\"apply\"},\"type\":\"change_summary\"}\n{\"@level\":\"info\",\"@message\":\"Outputs: 0\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:45:13.232734Z\",\"outputs\":{},\"type\":\"outputs\"}\n"}
crossplane-provider-jet-aws-599c4ea494ca-55b9dd885f-zr2pb provider-jet-aws 1.6630263132466455e+09 DEBUG provider-jet-aws External resource is up to date {"controller": "managed/iam.aws.jet.crossplane.io/v1alpha2, kind=policy", "request": "/terrajetloadtestslow", "uid": "636bbd3e-94bb-40ae-acde-f866e095b0e6", "version": "573976982", "external-name": "arn:aws:iam::ACCOUNT_NUMBER:policy/terrajetloadtestslow", "requeue-after": 1663026373.2466433}
From these logs we can also see the time it took for the resource to be Ready
. From 23:28:19 to 23:45:13, ~17 minutes. (using GMT time to make it easier)
It should also be noted that the resource is created in the cloud provider at 23:28:25 (12 seconds), when the logs say Successfully requested creation of external resource
, but it takes a long time until the resource is marked as Ready
in kubernetes
The reason why it takes so long seems to be that resource change events are not given any priority over the normal reconciliation loop. The controller stays in an endless loop reconciling every resource it knows about, which is necessary to override external changes in the providers, but any resource that is added or changed in kubernetes doesn't get immediately picked up. Rather it seems to be added to the queue and only seen after all existing reconcile loops are done, then the resource is picked up, and a creation flow starts, but the result will only be seen on the next reconciliation loop. Once that comes, terrajet will pick up the resource's status from the terraform state and requeue once more, only after this last requeue the status will be updated to Ready
.
It's also worth noting that changing the MaxConcurrentReconciles configuration to something bigger, like 5 or 10 reduces the time by that factor, but the underlying behavior persists and all resources are reconciled before considering new resource or updates Ready
.
What I'm having trouble understanding is how this terrajet
workflow differs from the native provider-aws
one, as both seem to use the same Reconcile function from crossplane-runtime.
Also, the UseAsync
configuration for a resource doesn't seem to change anything. I've tried both for policies.iam
and for repositories.ecr
with the same behavior.
How can we reproduce it?
- Create a new kubernetes cluster (with kind or in the cloud).
- Install crossplane
- Install a terrajet provider (I'll use AWS because it can be easily compared with the native
provider-aws
) - Create a handful of resources to be managed by terrajet (I'll use ECR because it is quickly created and dont incur costs)
- Wait until all resources are created and ready
it will take some minutes, but a burst of resources is expected to take a bit.
Althought it does take much longer thanprovider-aws
for the same resource. - Create one more resource to be managed by terrajet
The last step will take a long time, which is the problem this bug report is about.
Open collapsible for reproducible commands
# Create kind cluster
kind create cluster --name terrajet-load --image kindest/node:v1.23.10
# Install crossplane with helm
kubectl create namespace crossplane-system
helm repo add crossplane-stable https://charts.crossplane.io/stable
helm repo update
helm upgrade --install crossplane --namespace crossplane-system crossplane-stable/crossplane
# Install AWS terrajet provider
kubectl apply -f - <<YAML
apiVersion: pkg.crossplane.io/v1
kind: Provider
metadata:
name: crossplane-provider-jet-aws
spec:
controllerConfigRef:
name: config
package: crossplane/provider-jet-aws:v0.5.0
---
apiVersion: pkg.crossplane.io/v1alpha1
kind: ControllerConfig
metadata:
name: config
spec:
args:
- --debug
YAML
## use your aws credentials in the secret, if you have a custom way to interact with AWS, change the credentials key of this secret.
kubectl create secret generic -n crossplane-system --from-file=credentials=$HOME/.aws/credentials aws-credentials
kubectl apply -f - <<YAML
apiVersion: aws.jet.crossplane.io/v1alpha1
kind: ProviderConfig
metadata:
finalizers:
- in-use.crossplane.io
name: default
spec:
credentials:
secretRef:
key: credentials
name: aws-credentials
namespace: crossplane-system
source: Secret
YAML
# Create a handful of resources managed by terrajet. I chose policies because they are created near-instantly in the cloud and don't incur costs
## Note: seq on MacOS doesn't seem to support the -w flag, it can be removed safely below
for n in $(seq -w 200); do
echo "---"
sed "s/NUMBER/$n/" <<YAML
apiVersion: iam.aws.jet.crossplane.io/v1alpha2
kind: Policy
metadata:
name: terrajetloadtestNUMBER
spec:
forProvider:
name: terrajetloadtestNUMBER
policy: |
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Deny",
"Action": "*",
"Resource": ["*"]
}
]
}
providerConfigRef:
name: default
YAML
done | kubectl apply -f -
# Wait until all are ready, it took about 15 minutes for me
kubectl get policies.iam.aws.jet.crossplane.io
# Create one more resource
kubectl apply -f - <<YAML
apiVersion: iam.aws.jet.crossplane.io/v1alpha2
kind: Policy
metadata:
name: terrajetloadtestslow
spec:
forProvider:
name: terrajetloadtestslow
policy: |
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Deny",
"Action": "*",
"Resource": ["*"]
}
]
}
providerConfigRef:
name: default
YAML
# It takes a long time for the resource to become ready.
kubectl get policies.iam.aws.jet.crossplane.io terrajetloadtestslow
########
# Cleanup cloud resources
kubectl delete policy.iam.aws.jet $(kubectl get policy.iam.aws.jet | grep terrajetloadtest | awk '{print $1}')
# Delete kind cluster
kind destroy cluster --name terrajet-load
Possibly related discussions
Please let me know if I can help figure out what's happening.
Hello team. I want to update this issue with my latest findings.
After a while debugging and understanding a bit better how everything works, I figured the main issue seems to be that terrajet providers, in general, don't work fast enough to be able to consume the controller's workqueue faster than it's produced. Which causes the delays we're seeing as, for any change, the controller needs to go through all the workqueue before it can get to the recent change.
I've setup two side-by-side AWS providers, one using terrajet and another using provider-aws. Then I've setup a prometheus and grafana to collect the metrics exposed by the controller manager.
Open for more details on the experiment setup
Experiment environment
- I ran the experiments in my local machine with a 12 threads CPU (Ryzen 5600X) and 16GB of ram. None were topped during the tests.
- I created a kind cluster to run the experiments on (similar to the main message setup).
- I ran each provider directly on my machine (from the
./_output/bin/linux_amd64/provider
build artifact)- to allow for both providers to run at the same time, I've added the
MetricsBindAddress
option to the controller manager so both could expose metrics. I've used port8080
(default) for the jet provider and8081
for the native one
- to allow for both providers to run at the same time, I've added the
- Unless specified, the providers were executed without additional flags or code changes
- Relevant environment variables for jet provider:
TF_PLUGIN_CACHE_DIR=/tmp/terraform/.terraform
to improve provider download performanceTERRAFORM_VERSION=1.0.5
TERRAFORM_PROVIDER_SOURCE=hashicorp/aws
TERRAFORM_PROVIDER_VERSION=3.56.0
Metrics Setup
The grafana dashboard was created pointing to the prometheus instance
Configuration files
Prometheus.yaml
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
external_labels:
monitor: 'monitor'
rule_files: []
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'jet-provider'
static_configs:
- targets: ['localhost:8080']
- job_name: 'native-provider'
static_configs:
- targets: ['localhost:8081']
Grafana dashboard:
{
"annotations": {
"list": [
{
"builtIn": 1,
"datasource": {
"type": "grafana",
"uid": "-- Grafana --"
},
"enable": true,
"hide": true,
"iconColor": "rgba(0, 211, 255, 1)",
"name": "Annotations & Alerts",
"target": {
"limit": 100,
"matchAny": false,
"tags": [],
"type": "dashboard"
},
"type": "dashboard"
},
{
"datasource": {
"type": "datasource",
"uid": "grafana"
},
"enable": true,
"iconColor": "blue",
"name": "Annotations",
"target": {
"limit": 1000,
"refId": "Anno",
"tags": [
"annotation"
],
"type": "tags"
}
}
]
},
"editable": true,
"fiscalYearStartMonth": 0,
"graphTooltip": 0,
"id": 2,
"links": [],
"liveNow": false,
"panels": [
{
"collapsed": false,
"gridPos": {
"h": 1,
"w": 24,
"x": 0,
"y": 0
},
"id": 10,
"panels": [],
"title": "Controller",
"type": "row"
},
{
"datasource": {
"type": "prometheus",
"uid": "cdbCIg4Vk"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 0,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "linear",
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "auto",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
},
"unit": "s"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 1
},
"id": 6,
"options": {
"legend": {
"calcs": [],
"displayMode": "list",
"placement": "bottom",
"showLegend": true
},
"tooltip": {
"mode": "single",
"sort": "none"
}
},
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "cdbCIg4Vk"
},
"editorMode": "builder",
"expr": "rate(controller_runtime_reconcile_time_seconds_sum{controller=~\"$controllers\"}[$__rate_interval])",
"hide": true,
"legendFormat": "__auto",
"range": true,
"refId": "A"
},
{
"datasource": {
"type": "prometheus",
"uid": "cdbCIg4Vk"
},
"editorMode": "builder",
"expr": "controller_runtime_max_concurrent_reconciles{controller=~\"$controllers\"}",
"hide": true,
"legendFormat": "__auto",
"range": true,
"refId": "B"
},
{
"datasource": {
"name": "Expression",
"type": "__expr__",
"uid": "__expr__"
},
"expression": "$A/$B",
"hide": false,
"refId": "C",
"type": "math"
}
],
"title": "Controller Reconcile Time",
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "cdbCIg4Vk"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 0,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "linear",
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "auto",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"max": 10,
"min": -1,
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
}
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 1
},
"id": 12,
"options": {
"legend": {
"calcs": [],
"displayMode": "list",
"placement": "bottom",
"showLegend": true
},
"tooltip": {
"mode": "single",
"sort": "none"
}
},
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "cdbCIg4Vk"
},
"editorMode": "builder",
"expr": "controller_runtime_reconcile_errors_total{controller=~\"$controllers\"}",
"legendFormat": "__auto",
"range": true,
"refId": "A"
}
],
"title": "Reconcile Errors",
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "cdbCIg4Vk"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 0,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "linear",
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "auto",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
},
"unit": "s"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 9
},
"id": 16,
"options": {
"legend": {
"calcs": [],
"displayMode": "list",
"placement": "bottom",
"showLegend": true
},
"tooltip": {
"mode": "single",
"sort": "none"
}
},
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "cdbCIg4Vk"
},
"editorMode": "builder",
"expr": "histogram_quantile(0.95, sum by(le, controller) (rate(controller_runtime_reconcile_time_seconds_bucket{controller=~\"$controllers\"}[$__rate_interval])))",
"hide": true,
"legendFormat": "__auto",
"range": true,
"refId": "A"
},
{
"datasource": {
"type": "prometheus",
"uid": "cdbCIg4Vk"
},
"editorMode": "builder",
"expr": "controller_runtime_max_concurrent_reconciles{controller=~\"$controllers\"}",
"hide": true,
"legendFormat": "__auto",
"range": true,
"refId": "B"
},
{
"datasource": {
"name": "Expression",
"type": "__expr__",
"uid": "__expr__"
},
"expression": "$A/$B",
"hide": false,
"refId": "C",
"type": "math"
}
],
"title": "95th percentile reconcile time",
"type": "timeseries"
},
{
"gridPos": {
"h": 1,
"w": 24,
"x": 0,
"y": 17
},
"id": 8,
"title": "Queue",
"type": "row"
},
{
"datasource": {
"type": "prometheus",
"uid": "cdbCIg4Vk"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 0,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "linear",
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "auto",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
},
"unit": "ms"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 18
},
"id": 4,
"options": {
"legend": {
"calcs": [],
"displayMode": "list",
"placement": "bottom",
"showLegend": true
},
"tooltip": {
"mode": "single",
"sort": "none"
}
},
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "cdbCIg4Vk"
},
"editorMode": "builder",
"expr": "rate(workqueue_queue_duration_seconds_sum{name=~\"$controllers\"}[$__rate_interval])",
"legendFormat": "__auto",
"range": true,
"refId": "A"
}
],
"title": "Workqueue queue duration",
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "cdbCIg4Vk"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 0,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "linear",
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "auto",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
}
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 18
},
"id": 2,
"options": {
"legend": {
"calcs": [],
"displayMode": "list",
"placement": "bottom",
"showLegend": true
},
"tooltip": {
"mode": "single",
"sort": "none"
}
},
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "cdbCIg4Vk"
},
"editorMode": "builder",
"expr": "workqueue_depth{name=~\"$controllers\"}",
"legendFormat": "__auto",
"range": true,
"refId": "A"
}
],
"title": "Workqueue depth",
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "cdbCIg4Vk"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 0,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "linear",
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "auto",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
}
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 26
},
"id": 14,
"options": {
"legend": {
"calcs": [],
"displayMode": "list",
"placement": "bottom",
"showLegend": true
},
"tooltip": {
"mode": "single",
"sort": "none"
}
},
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "cdbCIg4Vk"
},
"editorMode": "builder",
"expr": "rate(workqueue_adds_total{name=~\"$controllers\"}[$__rate_interval])",
"legendFormat": "__auto",
"range": true,
"refId": "A"
}
],
"title": "Rate of Queue Adds",
"type": "timeseries"
}
],
"refresh": false,
"schemaVersion": 37,
"style": "dark",
"tags": [],
"templating": {
"list": [
{
"allValue": "managed/.*",
"current": {
"selected": false,
"text": [
"managed/iam.aws.jet.crossplane.io/v1alpha2, kind=policy",
"managed/policy.iam.aws.crossplane.io"
],
"value": [
"managed/iam.aws.jet.crossplane.io/v1alpha2, kind=policy",
"managed/policy.iam.aws.crossplane.io"
]
},
"datasource": {
"type": "prometheus",
"uid": "cdbCIg4Vk"
},
"definition": "controller_runtime_active_workers",
"description": "Controllers to watch. Regex",
"hide": 0,
"includeAll": true,
"label": "controller",
"multi": true,
"name": "controllers",
"options": [],
"query": {
"query": "controller_runtime_active_workers",
"refId": "StandardVariableQuery"
},
"refresh": 1,
"regex": ".*controller=\"(managed/.*?)\".*",
"skipUrlSync": false,
"sort": 0,
"type": "query"
}
]
},
"time": {
"from": "2022-10-03T04:52:47.659Z",
"to": "2022-10-03T05:09:38.612Z"
},
"timepicker": {},
"timezone": "",
"title": "Crossplane Controllers AWS",
"uid": "-WIVl0VVk",
"version": 2,
"weekStart": ""
}
In the following graphs, the green lines represent the jet provider, while the yellow line represents the native one.
Initially I've created 100 resources managed by each provider (first vertical blue line). There, we can see that the reconciliation time for the terrajet provider is much higher than the native provider. After some time I've changed the jet provider to have a MaxConcurrentReconciles: 10
in the configs (second blue line).
on the first section (after the first vertical blue line), the reconciliation time is about ~14.8s
for the jet provider and ~84ms
for the native provider.
on the second section (after the second vertical blue line), the reconcilicaiton time is about ~997ms
for the jet provider and ~84ms
for the native provider.
on the first section (after the first vertical blue line), the workqueue depth is about 100
items for the jet provider and 0
items for the native provider.
on the second section (after the second vertical blue line), the workqueue depth stabilizes at about 25
items for the jet provider and 0
for the native provider.
This left the question of "What if we bump the number of MaxConcurrentReconciles
further". Bumping it to 50
we see that the reconcile time didn't change, but the number of items in the queue were reduced to 0
on the terrajet provider as well.
This, however, is not maintained when we create even more resources. When jumping to 500
policies on each provider, we see that the reconcile time is, again, not affected. But the workqueue size is significantly affected and cannot go below 500
for the terrajet provider, while the native provider is momentarily affected by the change, but is able to catch up and clear the queue.
To make sure that the reconcile time was the main culprit here, I've also locally changed the Observe method of the native provider to include a time.Sleep(1 * time.Second)
and it had the same queue behavior as the terrajet provider. Which reflected in the same long time to reconcile new resources we saw in terrajet-based providers.
I'm curious to see which solutions we can brew up to this.
For now it seems that terrajet-based providers are not suitable for environments that need to manage many resources of the same type.