Performance: support-bundle takes > 3 mins
xavpaice opened this issue · 8 comments
Bug Description
When running tests in KOTS, the support-bundle
CLI test takes > 3 mins to run, causing timeouts. While increasing the timeout is a workaround, 3 mins is too long to expect Troubleshoot to run.
Expected Behavior
support-bundle
returns in < 3 mins
Steps To Reproduce
kURL cluster: https://kurl.sh/8cb4d5d
Command: kubectl support-bundle --load-cluster-specs --redactors=configmap/default/kotsadm-redact-spec/redact-spec,configmap/default/kotsadm-qakotsregression-redact-spec/redact-spec --interactive=false --allow-insecure-connections
curl -sSL https://kurl.sh/8cb4d5d | sudo bash
Additional Context
Include the following information.
- Troubleshoot version. If you built from source, note that including the version of Go you used to build with.
- Operating system
- Operating system version
- Other details that might be helpful in diagnosing the problem
TODO find the troubleshoot version, content of the extra redactors, and the Troubleshoot spec stored in cluster.
Using v0.69.3 (v0.70.0 is current), and the following specs:
apiVersion: troubleshoot.sh/v1beta2
kind: Collector
metadata:
name: sentry-enterprise
spec:
collectors:
- secret:
name: myapp-postgres
key: uri
includeValue: false
- logs:
selector:
- app=sentry
- role=worker
limits:
maxAge: 720h # 30*24
maxLines: 10000
- run:
collectorName: ping-google
image: flungo/netutils
command: ["ping"]
args: ["www.google.com"]
timeout: 5s
- http:
collectorName: echo-ip
get:
url: https://api.replicated.com/market/v1/echo/ip
apiVersion: troubleshoot.sh/v1beta2
kind: Redactor
metadata:
name: kotsadm-redact
spec:
redactors:
- name: redact cluster version info
fileSelector:
file: cluster-info/cluster_version.json
removals:
regex:
- redactor: ("major"?:) (")(?P<mask>.*)(")
- redactor: ("minor"?:) (")(?P<mask>.*)(")
Result:
============ Collectors summary =============
Suceeded (S), eXcluded (X), Failed (F)
=============================================
run/ping-google (F) : 36,044ms
cluster-resources (S) : 1,648ms
http/echo-ip (S) : 279ms
configmap/coredns (S) : 6ms
logs/app=sentry,role=worker (S) : 5ms
cluster-info (S) : 1ms
secret (S) : 0ms
============ Redactors summary =============
In-cluster collectors : 6,902ms
============= Analyzers summary =============
Suceeded (S), eXcluded (X), Failed (F)
=============================================
No analyzers executed
Duration: 44,913ms
real 0m45.313s
user 0m30.832s
sys 0m2.266s
Of that time, the longest gap:
I0718 23:43:25.639488 42228 run_pod.go:450] Pod ping-google has been scheduled for deletion
I0718 23:43:25.639516 42228 run_pod.go:454] Continuously poll each second for Pod ping-google deletion for maximum 60 seconds
I0718 23:43:56.645939 42228 redact.go:54] Redacting cluster-resources/priorityclasses.json
I0718 23:43:56.645952 42228 run.go:155] Collecting support bundle: failed to run collector: run/ping-google: timeout
Attached: logs from a cluster where this is displayed.
I can't see any clear gaps in the logs where a particular thing is taking the bulk of the time, more that there's a lot of items in the spec and some of those individually take more time than others to complete, and everything runs sequentially to avoid overloading the k8s API with requests.
This seems interesting. Look at these two pod times.
I0719 13:53:10.766915 50166 run.go:155] Collecting support bundle: run/ping-google
I0719 13:53:10.776941 50166 run_pod.go:155] Pod ping-google has been created
I0719 13:53:15.777192 50166 result.go:25] Creating symlink /ping-google.log -> cluster-resources/pods/logs/default/ping-google/collector.log
I0719 13:53:15.777402 50166 result.go:75] Added "/ping-google.log" symlink of "cluster-resources/pods/logs/default/ping-google/collector.log" in bundle output
I0719 13:53:15.789797 50166 result.go:130] Added ".json" to bundle output
I0719 13:53:15.789912 50166 result.go:130] Added "-events.json" to bundle output
I0719 13:53:15.799139 50166 run_pod.go:450] Pod ping-google has been scheduled for deletion
I0719 13:53:15.799175 50166 run_pod.go:454] Continuously poll each second for Pod ping-google deletion for maximum 60 seconds
I0719 13:53:46.808187 50166 run.go:155] Collecting support bundle: failed to run collector: run/ping-google: timeout
I'm guessing that's about 5 seconds to create the pod, as the delay from 'creating' to the log collection.
However, scheduling the pod for deletion times out and errors after about 30 seconds. Thats the vast majority of the runtime for this pod.
The very next pod doesn't see this deletion time, or really almost any deletion time at all.
I0719 13:53:46.808219 50166 run.go:155] Collecting support bundle: run/templated-registry-replicated-com
I0719 13:53:46.836244 50166 run_pod.go:155] Pod templated-registry-replicated-com has been created
I0719 13:53:55.976823 50166 result.go:25] Creating symlink templated-registry-replicated-com/templated-registry-replicated-com.log -> cluster-resources/pods/logs/default/templated-registry-replicated-com/collector.log
I0719 13:53:55.977088 50166 result.go:75] Added "templated-registry-replicated-com/templated-registry-replicated-com.log" symlink of "cluster-resources/pods/logs/default/templated-registry-replicated-com/collector.log" in bundle output
I0719 13:53:55.989749 50166 result.go:130] Added "templated-registry-replicated-com/templated-registry-replicated-com.json" to bundle output
I0719 13:53:55.989855 50166 result.go:130] Added "templated-registry-replicated-com/templated-registry-replicated-com-events.json" to bundle output
I0719 13:53:56.007181 50166 run_pod.go:450] Pod templated-registry-replicated-com has been scheduled for deletion
I0719 13:53:56.007216 50166 run_pod.go:454] Continuously poll each second for Pod templated-registry-replicated-com deletion for maximum 60 seconds
I0719 13:53:56.010445 50166 run.go:155] Collecting support bundle: run/templated-quay-io
I0719 13:53:56.028769 50166 run_pod.go:155] Pod templated-quay-io has been created
I0719 13:54:00.093888 50166 result.go:25] Creating symlink templated-quay-io/templated-quay-io.log -> cluster-resources/pods/logs/default/templated-quay-io/collector.log
I0719 13:54:00.094105 50166 result.go:75] Added "templated-quay-io/templated-quay-io.log" symlink of "cluster-resources/pods/logs/default/templated-quay-io/collector.log" in bundle output
I0719 13:54:00.108496 50166 result.go:130] Added "templated-quay-io/templated-quay-io.json" to bundle output
I0719 13:54:00.108637 50166 result.go:130] Added "templated-quay-io/templated-quay-io-events.json" to bundle output
I0719 13:54:00.139593 50166 run_pod.go:450] Pod templated-quay-io has been scheduled for deletion
I0719 13:54:00.139630 50166 run_pod.go:454] Continuously poll each second for Pod templated-quay-io deletion for maximum 60 seconds
I0719 13:54:00.144790 50166 run.go:155] Collecting support bundle: run/not-templated-quay-io
That looks suspiciously like the 30 second grace period that the pod is given to shut down. Considering we can't control how the pod behaves to a TERM and we've already collected our logs, perhaps we should be setting the grace-period to 0 and just moving on?
Edit:
Here it is again, we're loosing minutes to what I suspect is a deletion timeout.
I0719 13:51:04.150622 50166 run.go:155] Collecting support bundle: copy-from-host/kurl-host-preflights
I0719 13:51:06.255574 50166 result.go:130] Added "host-preflights/results-1689773363.txt" to bundle output
I0719 13:51:06.258035 50166 copy_from_host.go:375] Daemonset troubleshoot-copyfromhost-z62vx has been scheduled for deletion
I0719 13:51:06.264389 50166 copy_from_host.go:387] Continuously poll each second for Pod deletion of DaemontSet troubleshoot-copyfromhost-z62vx for maximum 60 seconds
I0719 13:51:38.276305 50166 run.go:155] Collecting support bundle: data
Wow even the collectd run is basically all waits and timeouts.
I0719 13:51:38.287858 50166 run.go:155] Collecting support bundle: collectd/collectd
I0719 13:52:08.298974 50166 copy_from_host.go:233] Timed out waiting for daemonset troubleshoot-copyfromhost-2sxw8 to be ready
I0719 13:52:08.328778 50166 result.go:130] Added "collectd/rrd/craig-qakotsregression/error.txt" to bundle output
I0719 13:52:08.328841 50166 copy_from_host.go:375] Daemonset troubleshoot-copyfromhost-2sxw8 has been scheduled for deletion
I0719 13:52:08.336949 50166 copy_from_host.go:387] Continuously poll each second for Pod deletion of DaemontSet troubleshoot-copyfromhost-2sxw8 for maximum 60 seconds
I0719 13:53:08.337791 50166 copy_from_host.go:415] Pod troubleshoot-copyfromhost-2sxw8-tdwws forcefully deleted after reaching the maximum wait time of 60 seconds
I0719 13:53:08.353265 50166 copy_from_host.go:423] Daemonset pod troubleshoot-copyfromhost-2sxw8-tdwws in default namespace has been deleted
I0719 13:53:08.353825 50166 run.go:155] Collecting support bundle: copy/spec-qakotsregression
Any chance we can see the error.txt on that thing?
- 30 seconds to decide it's not coming up
- 1 minute waiting for it to shutdown and then force deleting it
It spent 90 seconds of it's 90.065s run time waiting for it to start and waiting for it to shut down.
If we:
- Set grace period on deletion to 0
- Force delete with a grace period of 0 if the pod errors
We would remove 120 seconds of run time. (30 for 2 pods that didn't shut down gracefully, and 60 waiting to delete a pod that never made it to healthy and was force deleted anyway).
I'm just assuming we're not force deleting or setting grace period based on these timeouts. If we are waiting 60 seconds before force deleting, I propose we just start with that if the pod errored in the first place. 60 seconds is a pretty long timeout too.
If we:
- Set grace period on deletion to 0
- Force delete with a grace period of 0 if the pod errors
We would remove 120 seconds of run time. (30 for 2 pods that didn't shut down gracefully, and 60 waiting to delete a pod that never made it to healthy and was force deleted anyway).
I'm just assuming we're not force deleting or setting grace period based on these timeouts. If we are waiting 60 seconds before force deleting, I propose we just start with that if the pod errored in the first place. 60 seconds is a pretty long timeout too.
Work was done in #1196 PR to utilise zero second grace periods to force deleting unhealthy pods. For healthy pods launched by the run_pod
collector, termination needs to take timeouts into consideration which has not been implemented. I describe the problem with possible solutions here.
As for ping-google
collector, adding a timeout to the ping
command should do the trick.
apiVersion: troubleshoot.sh/v1beta2
kind: Collector
metadata:
name: sentry-enterprise
spec:
collectors:
- run:
collectorName: ping-google
image: flungo/netutils
command: ["ping"]
args: ["google.com", "-c", "3", "-w", "5"]
timeout: 5s
============ Collectors summary =============
Succeeded (S), eXcluded (X), Failed (F)
=============================================
run/ping-google (S) : 5,578ms
cluster-resources (S) : 2,396ms
cluster-info (S) : 7ms
============ Redactors summary =============
In-cluster collectors : 518ms
============= Analyzers summary =============
Succeeded (S), eXcluded (X), Failed (F)
=============================================
No analyzers executed
Duration: 8,528ms