Errors in agent's netns topology probe on k8s
waterjiao opened this issue · 4 comments
Hello
I used the master version, and I'm running skydive on k8s v0.19.0.
Env:
host: CentOS7
container: ubuntu20.04
My config is---skydive.yaml---skydive agent ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
labels:
app: skydive-agent
name: skydive-agent-config
data:
SKYDIVE_AGENT_TOPOLOGY_PROBES: runc docker
SKYDIVE_AGENT_LISTEN: 127.0.0.1:8081
SKYDIVE_AGENT_TOPOLOGY_NETNS_RUN_PATH: /host/run
When I add network namespace on host(Centos7)
# ip netns add net1
Here's the skydive agent log:
2021-03-06T06:56:44.413Z DEBUG netns/netns.go:133 (*ProbeHandler).Register host2: Register network namespace: /host/run/net1
2021-03-06T06:56:50.125Z ERROR netns/netns.go:307 (*ProbeHandler).start host2: Failed to register namespace: /host/run/net1. All attempts fail:
#1: /host/run/net1 does not seem to be a valid namespace
#2: /host/run/net1 does not seem to be a valid namespace
#3: /host/run/net1 does not seem to be a valid namespace
#4: /host/run/net1 does not seem to be a valid namespace
...
Note the /host/run/net1 does not seem to be a valid namespace
errors which means /host/run/net1 's device number is same with /host/run 's device number.
Code is:
if parent := filepath.Dir(path); parent != "" {
if err := syscall.Stat(parent, &parentStats); err == nil {
if stats.Dev == parentStats.Dev {
return fmt.Errorf("%s does not seem to be a valid namespace", path)
}
}
}
I use stat
command to check this:
in host:
# stat --format=%d /var/run/netns
22
# stat --format=%d /var/run/netns/net1
3
but in agent pod(container):
# stat --format=%d /host/run
22
# stat --format=%d /host/run/net1
22
Note net1's device number is different in host and pod.
It's tricky to debug. Has anyone encountered such a problem before?
Thanks
Hello. We did encounter such bugs some time ago but it was supposed to be fixed :-)
The reason for the check is the "ip netns" just creates a regular file for the new namespace then quick creates a bind mount from the namespace file in /proc to the regular file.
I'll try to reproduce the problem - pretty tricky to debug indeed - and I'll keep you updated
Did you use the Kubernetes template in contrib/kubernetes
? It specifies to use hostPID: true
Sorry for taking so long to answer.
Yes, I used the Kubernetes template in contrib/kubernetes
.
hostPID: true
hostNetwork: true
I did try to config more pod security policy.
This is my config:
hostPID: true
hostNetwork: true
hostIPC: true
securityContext:
privileged: true
runAsUser: 0
allowPrivilegeEscalation: true
It didn't work.
I also try on centos(host) with docker container, get the same issue.
env:
host: centos7
container: centos7
When I run docker container:
docker run -it --privileged -v /var/run/netns:/host/run docker.io/centos /bin/bash
When I add network namespace on host(Centos7)
# ip netns add net1
I use stat command to check this:
in host:
# stat --format=%d /var/run/netns
22
# stat --format=%d /var/run/netns/net1
3
but in container:
# stat --format=%d /host/run
22
# stat --format=%d /host/run/net1
22
Note net1's device number is different in host and container.
@waterjiao Hello. Sorry for the long delay.
On my CentOS 7 VM, I have the same results in the container that in the host. What storage driver are you using ? Is it overlayfs ?