collector logging running amok
Elyytscha opened this issue · 9 comments
our aggregated logging was exploding due to stackrox collector logging.. there where millions of lines like this the last days from 3/10 collector nodes..
actually to mitigate as fast as possible we had to delete stackrox
this is the error message which where present millions of times, this message appaered on 3 from 10 collector nodes about 6-10 times every millisecond..
[E 20221209 165715 ConnScraper.cpp:415] Could not determine network namespace: No such file or directory
it was deployed like this:
/bin/bash <(curl -fsSL https://raw.githubusercontent.com/stackrox/stackrox/master/scripts/quick-helm-install.sh)
on gke v1.24
Thanks for sharing this @Elyytscha, I'm sorry that this caused you to have to delete stackrox. To help in our investigation can you share any relevant details about the deployments running on the 3/10 affected nodes? Also you using GKE stable release channel?
Yeah, it was just an evaluation setup, so we didn't run this in production, so its not that critical that we have deleted it, but when this happens in an evaluation you will agree that this is a bad situation..
yes we are on gke stable release channel.
for relevant details about deployments i have to investigate a little bit.
@Elyytscha, thanks for this confirmation of the release channel. We will make sure to fix the logging flood soon.
Would you be aware of a common factor to these 3 nodes having the issue ? Is something different about them relatively to the other 7 ones ?
We will try to reproduce the issue, but in the meantime, it would greatly help if you could provide us with the result of ls -lR /proc/xxxx
on the host of a failing node, replacing xxxx by any PID available.
@Elyytscha, based on the early analysis of this issue, it seems that you should be able to workaround by disabling the scraper
component of collector. This can be achieved by modifying the daemonset definition and set "turnOffScrape"
to true in the COLLECTOR_CONFIG
env var.
https://github.com/stackrox/collector/blob/master/docs/references.md#collector-config
Hello,
first I wanted to say thanks for your help! This is not self-evident within opensource projects, to get answers that fast.
Would you be aware of a common factor to these 3 nodes having the issue ? Is something different about them relatively to the other 7 ones ?
no, those nodes a from the same resource pool, the only difference to other nodes could be the pods which get scheduled to those nodes
maybe some information is that we use calico as SDN
this info i can give about pods running on one of the affected nodes (but its possible that the pods which caused this could be already scheduled to another node..)
We will try to reproduce the issue, but in the meantime, it would greatly help if you could provide us with the result of
ls -lR /proc/xxxx
on the host of a failing node, replacing xxxx by any PID available.
be prepared, the logfile i got for this is long.. its so big, i had to split it into two files, but i think what you want to see you will find there, because there are also 'No such file or directory' messages
getpids.log
getpids.2.log
@Elyytscha, based on the early analysis of this issue, it seems that you should be able to workaround by disabling the
scraper
component of collector. This can be achieved by modifying the daemonset definition and set"turnOffScrape"
to true in theCOLLECTOR_CONFIG
env var.
There i have some questions:
- Which functionality from stackrox will be disabled/unavailable if we turn this off?
- Is this possible via the helm chart? https://github.com/stackrox/helm-charts/blob/5cd826a14d7c30d1b7ca538b4ff71d1723339a2c/3.72.2/secured-cluster-services/values-public.yaml.example#L324 would this be the right place for setting this env var?
We debugged a little bit down the rabbit hole, what we found out is that this happens on nodes where we have running static jenkins agents which are building docker container in containerd.
the relating pids where the error appears within stackrox are looking like this:
# ps aux | grep defunct
root 38481 0.0 0.0 0 0 ? Z 15:36 0:00 [containerd-shim] <defunct>
root 38482 0.0 0.0 0 0 ? Z 15:36 0:00 [containerd-shim] <defunct>
root 38804 0.0 0.0 0 0 ? Z 15:36 0:00 [containerd-shim] <defunct>
root 38866 0.0 0.0 0 0 ? Z 15:36 0:00 [containerd-shim] <defunct>
root 38958 0.0 0.0 0 0 ? Z 15:36 0:00 [containerd-shim] <defunct>
root 39059 0.0 0.0 0 0 ? Z 15:36 0:00 [containerd-shim] <defunct>
root 39193 0.0 0.0 0 0 ? Z 15:36 0:00 [containerd-shim] <defunct>
root 91761 0.0 0.0 0 0 ? Z 16:22 0:00 [containerd-shim] <defunct>
root 91807 0.0 0.0 0 0 ? Z 16:22 0:00 [containerd-shim] <defunct>
root 91952 0.0 0.0 0 0 ? Z 16:22 0:00 [containerd-shim] <defunct>
root 92009 0.0 0.0 0 0 ? Z 16:22 0:00 [containerd-shim] <defunct>
root 92243 0.0 0.0 0 0 ? Z 16:22 0:00 [containerd-shim] <defunct>
root 92332 0.0 0.0 0 0 ? Z 16:22 0:00 [containerd-shim] <defunct>
root 92577 0.0 0.0 0 0 ? Z 16:22 0:00 [containerd-shim] <defunct>
and we think its related to this problem in containerd:
containerd/containerd#5708
Good to hear that you may have identified the root cause. Setting turnOffScrape to true will solve the problem with the logging statement, but finding the root cause and solving that is the best option. turnOffScrape has not been extensively tested and is used for internal debugging. When you set it to true you will lose information about endpoints and connections formed before collector was turned on. You can set it via helm charts, but that is not the method that I would recommend. You have to be careful when doing it that way, as COLLECTOR_CONFIG is used to set a few different parameters and you want to set those correctly, not just turnOffScrape. A command you could run to set turnOffScrape is
kubectl set env ds/collector COLLECTOR_CONFIG='{"turnOffScrape":true,"tlsConfig":{"caCertPath":"/var/run/secrets/stackrox.io/certs/ca.pem","clientCertPath":"/var/run/secrets/stackrox.io/certs/cert.pem","clientKeyPath":"/var/run/secrets/stackrox.io/certs/key.pem"}}' --namespace stackrox
To use helm charts you have to set "tlsConfig" correctly. See https://github.com/stackrox/helm-charts/blob/5cd826a14d7c30d1b7ca538b4ff71d1723339a2c/3.72.2/secured-cluster-services/templates/collector.yaml#L[…]2
Before your latest comment I thought that the problem might be that your resource limits and requests for resources in namespaces other than stackrox being too low. It might still be worth looking into that.
We fixed it, the issue was due to our old docker in docker container build system, for new systems we actually use kaniko but for old legacy systems there are docker builds via dockerd in a containerd system, we fixed it basically due to this comment:
docker-library/docker#318 (comment)
docker has added tini as docker-init in their container and we used it with:
ENTRYPOINT ["docker-init", "--", "dockerd-entrypoint.sh"]
after this we had no zombie/defunct processes anymore after docker in docker build where running.
actually someone could argument that stackrox showed us an issue in our k8s cluster, but the way it showed us the issue basically produced another issue (flooding our log system with an unnecessary amount of logs).
but still i think it would be a good idea to limit those log messages stackrox btw. collector will produce in such a situation.
the situation appears when there are zombie/defunct processes from old containers which somehow dont get reaped.
Glad to hear that you resolved your problem. Thanks for bringing this to our attention. Based on this experience we plan a few improvements to collector including throttling of logging statements and better handling of defunct processes.