CgroupV2 PSI and Perf Daemonset
About
This is a docker container that can be deployed as a daemonset on any kubernetes pod to monitor PSI metrics.
Built With
Getting Started
To deploy a daemonset follow these steps.
Prerequisites
Minimum versions:
Docker 20.10
Linux 5.2
Kubernetes 1.17
The host machine for all the nodes on the cluster must be using cgroupv2.
Check CgroupV2 Availability
Ensure that your machine has cgroupv2 available:
$ grep cgroup /proc/filesystems
nodev cgroup
nodev cgroup2
Just because you have cgroupv2 it doesn't mean you are using it. Check that the unified cgroup is enabled by checking the hierarchy.
$ ll /sys/fs/cgroup/
total 0
dr-xr-xr-x 5 root root 0 Oct 31 14:52 ./
drwxr-xr-x 10 root root 0 Oct 31 14:52 ../
-r--r--r-- 1 root root 0 Nov 1 08:45 cgroup.controllers
-rw-r--r-- 1 root root 0 Nov 1 08:45 cgroup.max.depth
-rw-r--r-- 1 root root 0 Nov 1 08:45 cgroup.max.descendants
-rw-r--r-- 1 root root 0 Nov 1 08:45 cgroup.procs
-r--r--r-- 1 root root 0 Nov 1 08:45 cgroup.stat
-rw-r--r-- 1 root root 0 Oct 31 14:52 cgroup.subtree_control
-rw-r--r-- 1 root root 0 Nov 1 08:45 cgroup.threads
-rw-r--r-- 1 root root 0 Nov 1 08:45 cpu.pressure
-r--r--r-- 1 root root 0 Nov 1 08:45 cpuset.cpus.effective
-r--r--r-- 1 root root 0 Nov 1 08:45 cpuset.mems.effective
drwxr-xr-x 2 root root 0 Nov 1 08:45 init.scope/
-rw-r--r-- 1 root root 0 Nov 1 08:45 io.cost.model
-rw-r--r-- 1 root root 0 Nov 1 08:45 io.cost.qos
-rw-r--r-- 1 root root 0 Nov 1 08:45 io.pressure
-rw-r--r-- 1 root root 0 Nov 1 08:45 memory.pressure
drwxr-xr-x 106 root root 0 Nov 1 08:45 system.slice/
drwxr-xr-x 3 root root 0 Oct 31 14:52 user.slice/
Note the slice dirs.
If you have cgroupv2 but it isn't enabled the above structure will be available in /sys/fs/cgroup/unified
.
Enable cgroupv2
Edit /etc/default/grub
and add systemd.unified_cgroup_hierarchy=1
to GRUB_CMDLINE_LINUX
Run sudo update-grub
and reboot the system.
If cgroupv2 is not available on the system you will have to update the kernel version to meet the prerequisites above.
Build Image
There are two docker files one for regular deployment and the other for debugging.
If you want to run the server locally without a container/kubernetes deployment edit pid_lookup.go
to resolve the systems cgroup dir.
Regular image
bash create_image.sh latest
please change the image name in create_image.sh
Port
Set PORT
env var to specify the metrics port.
Usage
Assuming all the prerequisites have been met and image built and pushed to your docker repository follow these steps to deploy the daemonset.
In this section I will refer to the monitoring container as the daemonset and the container being monitored as the host container.
The daemonset loads /proc
and /var/lib/docker
directories. It finds the container pid by searching /var/lib/docker
. Then it accesses the /proc
directory with pid to find the container's cgroup information.
The service is used to expose the daemonset webserver where the metrics are hosted. If you are not using some kind of service mesh make sure your Prometheus deployment is on the same namespace as your daemonset deployment.
perf collector is implement based on perf-utils v0.4.0. In this project, we exposed the performance information for each container process in one host. If you want to query the total performance of one host, please refer to node_exporter.
In order to collect perf, we should keep the daemonset as an administrator, and share the host pid with container. These configuration can be set in pod YAML file, as a spec.hostPid
and spec.containers.securityContext
. The example.yaml shows how to configure them.
And note that the kernel trace endpoints should be accessable. You can export the endpoints as follows:
vim /etc/sysctl.conf
cat /etc/sysctl.conf
kernel.perf_event_paranoid= -1
sysctl -p /etc/sysctl.conf
Because the cost of perf collector is high, so we set an switch to enable and disable perf collector. And you can also set which perf metric should be collected. All of the supported perf metrics can be seen in project perf-utils, as follows
Available HW_PERF_LABELS
CPUCycles,Instructions,CacheRefs,CacheMisses,BranchInstr,BranchMisses,BusCycles,StalledCyclesFrontend,StalledCyclesBackend,RefCPUCycles,TimeEnabled,TimeRunning
Available SW_PERF_LABELS
CPUClock,TaskClock,PageFaults,ContextSwitches,CPUMigrations,MinorPageFaults,MajorPageFaults,AlignmentFaults,EmulationFaults,TimeEnabled,TimeRunning
Available CACHE_PERF_LABELS
L1DataReadHit,L1DataReadMiss,L1DataWriteHit,L1InstrReadMiss,LastLevelReadHit,LastLevelReadMiss,LastLevelWriteHit,LastLevelWriteMiss,DataTLBReadHit,DataTLBReadMiss,DataTLBWriteHit,DataTLBWriteMiss,InstrTLBReadHit,InstrTLBReadMiss,BPUReadHit,BPUReadMiss,NodeReadHit,NodeReadMiss,NodeWriteHit,NodeWriteMiss,TimeEnabled,TimeRunning
The fields in the yaml file are as follows:
- name: PERF_COLLECTOR_ENABLED
value: "true"
- name: HW_PERF_LABELS
value: "Instructions"
Then just point Prometheus to the /metrics
endpoint of your pod on the metrics port.
- job_name: 'psi-perf'
kubernetes_sd_configs:
- role: endpoints
scheme: http
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__address__]
separator: ;
regex: (.+):\d+
target_label: __address__
replacement: $1:2333
action: replace
- source_labels: [__address__]
separator: ;
regex: (.+):\d+
target_label: instance
replacement: $1
action: replace
Example kubernetes yaml
see example/
directory
API
There are a few endpoints:
/
Homepage/health
K8s health endpoint/psi
Debugging PSI/metrics
Prom metrics+psi endpoint
Data Available
The following PSI metrics are reported to Prometheus and are available for querying.
# HELP psi_perf_monitor_cpu_cycles CPU migration of monitored container
# TYPE psi_perf_monitor_cpu_cycles gauge
psi_perf_monitor_cpu_cycles{container_name="cgroup-monitor-sc",namespace="monitor",pid="5275",pod_name="cgroup-monitor-sc-nw78c"} 2.6810185e+07
psi_perf_monitor_cpu_cycles{container_name="etcd",namespace="kube-system",pid="31708",pod_name="etcd-crack-bedbug"} 5.949181e+06
psi_perf_monitor_cpu_cycles{container_name="kube-apiserver",namespace="kube-system",pid="31318",pod_name="kube-apiserver-crack-bedbug"} 0
psi_perf_monitor_cpu_cycles{container_name="kube-controller-manager",namespace="kube-system",pid="31809",pod_name="kube-controller-manager-crack-bedbug"} 1.887559e+07
psi_perf_monitor_cpu_cycles{container_name="kube-flannel",namespace="kube-system",pid="32265",pod_name="kube-flannel-ds-amd64-cszlv"} 3.4171708e+07
psi_perf_monitor_cpu_cycles{container_name="kube-scheduler",namespace="kube-system",pid="32102",pod_name="kube-scheduler-crack-bedbug"} 0
psi_perf_monitor_cpu_cycles{container_name="node-exporter",namespace="monitor",pid="22435",pod_name="prometheus-prometheus-node-exporter-jgtv7"} 2.89654836e+08
# HELP psi_perf_monitor_instruction instruction of monitored container
# TYPE psi_perf_monitor_instruction gauge
psi_perf_monitor_instruction{container_name="cgroup-monitor-sc",namespace="monitor",pid="5275",pod_name="cgroup-monitor-sc-nw78c"} 5.0756236e+07
psi_perf_monitor_instruction{container_name="etcd",namespace="kube-system",pid="31708",pod_name="etcd-crack-bedbug"} 1.2358213e+07
psi_perf_monitor_instruction{container_name="kube-apiserver",namespace="kube-system",pid="31318",pod_name="kube-apiserver-crack-bedbug"} 0
psi_perf_monitor_instruction{container_name="kube-controller-manager",namespace="kube-system",pid="31809",pod_name="kube-controller-manager-crack-bedbug"} 1.5420931e+07
psi_perf_monitor_instruction{container_name="kube-flannel",namespace="kube-system",pid="32265",pod_name="kube-flannel-ds-amd64-cszlv"} 5.9731916e+07
psi_perf_monitor_instruction{container_name="kube-scheduler",namespace="kube-system",pid="32102",pod_name="kube-scheduler-crack-bedbug"} 0
psi_perf_monitor_instruction{container_name="node-exporter",namespace="monitor",pid="22435",pod_name="prometheus-prometheus-node-exporter-jgtv7"} 1.89660562e+08
# HELP psi_perf_monitor_sc_monitored_cpu_psi CPU PSI of monitored container
# TYPE psi_perf_monitor_sc_monitored_cpu_psi gauge
psi_perf_monitor_sc_monitored_cpu_psi{container_name="carts",instance="172.169.8.219",job="cgroup-monitor",pod_name="carts-677b598f6f-lb9zn",type="some",window="10s"} 0
psi_perf_monitor_sc_monitored_cpu_psi{container_name="carts",instance="172.169.8.219",job="cgroup-monitor",pod_name="carts-677b598f6f-lb9zn",type="some",window="300s"} 0
psi_perf_monitor_sc_monitored_cpu_psi{container_name="carts",instance="172.169.8.219",job="cgroup-monitor",pod_name="carts-677b598f6f-lb9zn",type="some",window="60s"} 0
psi_perf_monitor_sc_monitored_cpu_psi{container_name="carts",instance="172.169.8.219",job="cgroup-monitor",pod_name="carts-677b598f6f-lb9zn",type="some",window="total"} 328157795
# HELP psi_perf_monitor_sc_monitored_io_psi IO PSI of monitored container
# TYPE psi_perf_monitor_sc_monitored_io_psi gauge
psi_perf_monitor_sc_monitored_io_psi{container_name="carts", instance="172.169.8.219", job="cgroup-monitor", pod_name="carts-677b598f6f-lb9zn", type="full", window="10s"} 0
psi_perf_monitor_sc_monitored_io_psi{container_name="carts", instance="172.169.8.219", job="cgroup-monitor", pod_name="carts-677b598f6f-lb9zn", type="full", window="300s"} 0
psi_perf_monitor_sc_monitored_io_psi{container_name="carts", instance="172.169.8.219", job="cgroup-monitor", pod_name="carts-677b598f6f-lb9zn", type="full", window="60s"} 0
psi_perf_monitor_sc_monitored_io_psi{container_name="carts", instance="172.169.8.219", job="cgroup-monitor", pod_name="carts-677b598f6f-lb9zn", type="full", window="total"} 69165
psi_perf_monitor_sc_monitored_io_psi{container_name="carts", instance="172.169.8.219", job="cgroup-monitor", pod_name="carts-677b598f6f-lb9zn", type="some", window="10s"} 0
psi_perf_monitor_sc_monitored_io_psi{container_name="carts", instance="172.169.8.219", job="cgroup-monitor", pod_name="carts-677b598f6f-lb9zn", type="some", window="300s"} 0
psi_perf_monitor_sc_monitored_io_psi{container_name="carts", instance="172.169.8.219", job="cgroup-monitor", pod_name="carts-677b598f6f-lb9zn", type="some", window="60s"} 0
psi_perf_monitor_sc_monitored_io_psi{container_name="carts", instance="172.169.8.219", job="cgroup-monitor", pod_name="carts-677b598f6f-lb9zn", type="some", window="total"} 69210
# HELP psi_perf_monitor_sc_monitored_mem_psi Mem PSI of monitored container
# TYPE psi_perf_monitor_sc_monitored_mem_psi gauge
psi_perf_monitor_sc_monitored_mem_psi{container_name="carts", instance="172.169.8.219", job="cgroup-monitor", pod_name="carts-677b598f6f-lb9zn", type="full", window="10s"} 0
psi_perf_monitor_sc_monitored_mem_psi{container_name="carts", instance="172.169.8.219", job="cgroup-monitor", pod_name="carts-677b598f6f-lb9zn", type="full", window="300s"} 0
psi_perf_monitor_sc_monitored_mem_psi{container_name="carts", instance="172.169.8.219", job="cgroup-monitor", pod_name="carts-677b598f6f-lb9zn", type="full", window="60s"} 0
psi_perf_monitor_sc_monitored_mem_psi{container_name="carts", instance="172.169.8.219", job="cgroup-monitor", pod_name="carts-677b598f6f-lb9zn", type="full", window="total"} 0
psi_perf_monitor_sc_monitored_mem_psi{container_name="carts", instance="172.169.8.219", job="cgroup-monitor", pod_name="carts-677b598f6f-lb9zn", type="some", window="10s"} 0
psi_perf_monitor_sc_monitored_mem_psi{container_name="carts", instance="172.169.8.219", job="cgroup-monitor", pod_name="carts-677b598f6f-lb9zn", type="some", window="300s"} 0
psi_perf_monitor_sc_monitored_mem_psi{container_name="carts", instance="172.169.8.219", job="cgroup-monitor", pod_name="carts-677b598f6f-lb9zn", type="some", window="60s"} 0
psi_perf_monitor_sc_monitored_mem_psi{container_name="carts", instance="172.169.8.219", job="cgroup-monitor", pod_name="carts-677b598f6f-lb9zn", type="some", window="total"} 0