netdata/helmchart

parent vs child? I don't get it

dongho-jung opened this issue · 15 comments

I thought 'The nodes that send metrics are called child nodes, and the nodes that receive metrics are called parent nodes.' according to here.

However the configs for them are as follows:
parent:

configs:
netdata:
enabled: true
path: /etc/netdata/netdata.conf
data: |
[global]
memory mode = save
[plugins]
cgroups = no
tc = no
enable running new plugins = no
check for new plugins every = 72000
python.d = no
charts.d = no
go.d = no
node.d = no
apps = no
proc = no
idlejitter = no
diskspace = no

child:
configs:
netdata:
enabled: true
path: /etc/netdata/netdata.conf
data: |
[global]
memory mode = none
[health]
enabled = no

If I understand correctly, why config for plugins(collectors) is written in parent? shouldn't it be written in child? Because if it is in parent's config, it can only affect parent's collector which is just a deployment that deployed only specific node (opposite to daemonset for child).

If I understand wrong, how does it work with deployment?

thanks!

Hi, @0xF4D3C0D3 👋

Netdata child is a Daemonset.

A DaemonSet ensures that all (or some) Nodes run a copy of a Pod.

That means that there is a node with both parent and child instances running. That is why we want any data collection to be disabled on the parent instance.

why config for plugins(collectors) is written in parent?

We disable all the data collection plugins.

         [plugins] 
           cgroups = no 
           tc = no 
           enable running new plugins = no 
           check for new plugins every = 72000 
           python.d = no 
           charts.d = no 
           ...

Hi @ilyam8 ! thanks for reply!

your reply is the same what I thought it should be though, when I do like below, it didn't work :(

what I wanted to do was, as you said, disable all the collectors from parent

         [plugins]  # parent's config
           cgroups = no 
           tc = no 
           enable running new plugins = no 
           check for new plugins every = 72000 
           python.d = no 
           charts.d = no 
           ...

and then write child's config like this

 configs:  # child's config
    netdata:
      enabled: true
      path: /etc/netdata/netdata.conf
      data: |
        [global]
          memory mode = none
        [plugins]
          PATH environment variable = /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/sbin:/usr/sbin:/usr/local/bin:/usr/local/sbin
          PYTHONPATH environment variable =
          proc = yes
          diskspace = no
          timex = no
          cgroups = no
          tc = no
          idlejitter = no
          enable running new plugins = no
          check for new plugins every = 60
          slabinfo = no
          node.d = no
          perf = no
          fping = no
          go.d = no
          ioping = no
          python.d = no
          apps = no
          charts.d = no
          freeipmi = no
        [plugin:proc]
          netdata server resources = yes
          /proc/pagetypeinfo = no
          /proc/stat = yes
          /proc/uptime = no
          /proc/loadavg = no
          ...

so that only stuffs like cpu, overview etc can show.

However if I deploy like this, the dashboard is as follows:
image

I couldn't configure collectors for children, instead I could do so in parent's config.
am I doing it wrong? when I docker exec to check their config, the contents are as follows:

bash-5.0# #I'm child
bash-5.0# pwd
/etc/netdata
bash-5.0# cat netdata.conf
[global]
  memory mode = none
[health]
  enabled = no
[plugins]
  PATH environment variable = /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/sbin:/usr/sbin:/usr/local/bin:/usr/local/sbin
  PYTHONPATH environment variable =
  proc = yes
  diskspace = no
  timex = no
  cgroups = no
  tc = no
  idlejitter = no
  enable running new plugins = no
  check for new plugins every = 60
  slabinfo = no
  node.d = no
  perf = no
  fping = no
  go.d = no
  ioping = no
  python.d = yes
  apps = no
  charts.d = no
  freeipmi = no
[plugin:proc]
  netdata server resources = yes
  /proc/pagetypeinfo = no
  /proc/stat = yes
  /proc/uptime = no
  /proc/loadavg = no
  /proc/sys/kernel/random/entropy_avail = no
  /proc/pressure = no
  /proc/interrupts = no
  /proc/softirqs = no
  /proc/vmstat = no
  /proc/meminfo = no
  /sys/kernel/mm/ksm = no
  /sys/block/zram = no
  /sys/devices/system/edac/mc = no
  /sys/devices/system/node = no
  /proc/net/dev = no
  /proc/net/wireless = no
  /proc/net/sockstat = no
  /proc/net/sockstat6 = no
  /proc/net/netstat = no
  /proc/net/snmp = no
  /proc/net/snmp6 = no
  /proc/net/sctp/snmp = no
  /proc/net/softnet_stat = no
  /proc/net/ip_vs/stats = no
  /sys/class/infiniband = no
  /proc/net/stat/conntrack = no
  /proc/net/stat/synproxy = no
  /proc/diskstats = no
  /proc/mdstat = no
  /proc/net/rpc/nfsd = no
  /proc/net/rpc/nfs = no
  /proc/spl/kstat/zfs/arcstats = no
  /proc/spl/kstat/zfs/pool/state = no
  /sys/fs/btrfs = no
  ipc = no
  /sys/class/power_supply = no
bash-5.0#
bash-5.0# #I'm parent
bash-5.0# pwd
/etc/netdata
bash-5.0# cat netdata.conf
[global]
  memory mode = save
[plugins]
  cgroups = no
  tc = no
  enable running new plugins = no
  check for new plugins every = 72000
  python.d = no
  charts.d = no
  go.d = no
  node.d = no
  apps = no
  proc = no
  idlejitter = no
  diskspace = no
bash-5.0#

If I reset child's config to default values and write parent's config as child's config, the dashboard will look as what I want.

have a good day! :D

@0xF4D3C0D3 you want:

  • disable all collectors on the parent instance
  • enable only proc collector on children instances

Is that the case?

@ilyam8 yeah exactly.

iiuc, the responsibility for collecting metrics is on child, for showing metrics is on parent..? thus, I think parent should disable all its collectors so that it should do only its job, showing metrics.

sorry for verbosity due to my poor English, so the point is, yes this is the case

iiuc, the responsibility for collecting metrics is on child, for showing metrics is on parent..? thus, I think parent should disable all its collectors so that it should do only its job, showing metrics.

That is the default configuration. You need no additional changes (assuming you want all data collectors enabled on children instances, not only proc.plguin, otherwise = no all other collectors).

Your config looks fine btw.

showing metrics on the parent

Keep in mind that in order to see a child instance metrics you need to switch to the instance using the "Replicated Nodes" menu.

Screenshot 2021-08-02 at 13 36 37

@ilyam8 umm.. maybe the problem is caused by my replicated nodes, I guess.

Your config looks fine btw.

Because even though it looks like I configured correctly, as you saw, there are no metrics on dashboard in spite of enabled proc plugin in child config.

image

and the above one is child's as you can see the subpath from url.

However, my replicated nodes look like this:
image

parent node and child node are should be different?

this is my context:

╭─dongho@host-1 in ~/***/netdata on main ✘ (origin/main)
╰$ ./scripts/log.sh proc



parent:
2021-08-02 08:29:07: netdata INFO  : MAIN : NETDATA_SYSTEM_VIRT_DETECTION=/proc/cpuinfo
2021-08-02 08:29:31: 6: 172 '[172.31.32.52]:52116' 'STREAM' (sent/all = 0/0 bytes -0%, prep/sent/total = 1627892971881.52/1627892971881.68/0.16 ms) 200 'key=***-bfd7-daaecea58804&hostname=host-1&registry_hostname=host-1&machine_guid=***-a68a-0aebf968c7ca&update_every=1&os=linux&timezone=UTC&tags=&ver=3&NETDATA_SYSTEM_OS_NAME=unknown&NETDATA_SYSTEM_OS_ID=unknown&NETDATA_SYSTEM_OS_ID_LIKE=unknown&NETDATA_SYSTEM_OS_VERSION=unknown&NETDATA_SYSTEM_OS_VERSION_ID=unknown&NETDATA_SYSTEM_OS_DETECTION=unknown&NETDATA_HOST_IS_K8S_NODE=true&NETDATA_SYSTEM_KERNEL_NAME=Linux&NETDATA_SYSTEM_KERNEL_VERSION=5.8.0-1041-aws&NETDATA_SYSTEM_ARCHITECTURE=x86_64&NETDATA_SYSTEM_VIRTUALIZATION=hypervisor&NETDATA_SYSTEM_VIRT_DETECTION=/proc/cpuinfo&NETDATA_SYSTEM_CONTAINER=docker&NETDATA_SYSTEM_CONTAINER_DETECTION=dockerenv&NETDATA_CONTAINER_OS_NAME=Alpine Linux&NETDATA_CONTAINER_OS_ID=alpine&NETDATA_CONTAINER_OS_ID_LIKE=unknown&NETDATA_CONTAINER_OS_VERSION=unknown&NETDATA_CONTAINER_OS_VERSION_ID=3.12.7&NETDATA_CONTAINER_OS_DETECTION=&NETDATA_SYSTEM_CPU_LOGICAL_CPU_COUNT=2&NETDATA_SYSTEM_CPU_FREQ=2399000000&NETDATA_SYSTEM_TOTAL_RAM=4123906048&NETDATA_SYSTEM_TOTAL_DISK_SIZE=53687091200&NETDATA_PROTOCOL_VERSION=1.1'



child:
2021-08-02 08:29:06: netdata INFO  : MAIN : NETDATA_SYSTEM_VIRT_DETECTION=/proc/cpuinfo
2021-08-02 08:29:07: netdata INFO  : PLUGIN[proc] : thread created with task id 8822
2021-08-02 08:29:07: netdata INFO  : PLUGIN[proc] : set name of thread 8822 to PLUGIN[proc]
╭─dongho@host-1 in ~/***/netdata on main ✘ (origin/main)
╰$ k get po
NAME                             READY   STATUS    RESTARTS   AGE
netdata-child-xtrvz              2/2     Running   0          3h32m
netdata-parent-fb488754b-xhjrh   1/1     Running   0          3h32m
╭─dongho@host-1 in ~/***/netdata on main ✘ (origin/main)
╰$ k get po -o wide
NAME                             READY   STATUS    RESTARTS   AGE     IP             NODE              NOMINATED NODE   READINESS GATES
netdata-child-xtrvz              2/2     Running   0          3h34m   172.31.32.52   ip-172-31-32-52   <none>           <none>
netdata-parent-fb488754b-xhjrh   1/1     Running   0          3h34m   172.31.32.52   ip-172-31-32-52   <none>           <none>
╭─dongho@host-1 in ~/***/netdata on main ✘ (origin/main)
╰$ k get ds
NAME            DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
netdata-child   1         1         1       1            1           <none>          2d6h
╭─dongho@host-1 in ~/***/netdata on main ✘ (origin/main)
╰$ k get deploy
NAME             READY   UP-TO-DATE   AVAILABLE   AGE
netdata-parent   1/1     1            1           2d6h
╭─dongho@host-1 in ~/***/netdata on main ✘ (origin/main)
╰$

Ah I remember. I deployed netdata helm on single EC2 instance without any loadbalancer. So I edited values.yaml to disable ingress and add hostNetwork to deployment.yaml and externalIP to service.yaml.

maybe it's the cause..? then what should I do to deploy helm netdata on single EC2 instance without loadbalancer?

diff --git a/helm_netdata/charts/netdata/templates/deployment.yaml b/my_helm_netdata/netdata/chart/templates/deployment.yaml
index 48b8150..07e1782 100644
--- a/helm_netdata/charts/netdata/templates/deployment.yaml
+++ b/my_helm_netdata/netdata/chart/templates/deployment.yaml
@@ -31,6 +31,7 @@ spec:
 {{ toYaml . | trim | indent 8 }}
 {{- end }}
     spec:
+      hostNetwork: true
       securityContext:
         fsGroup: 201
       serviceAccountName: {{ .Values.serviceAccount.name }}
diff --git a/helm_netdata/charts/netdata/templates/service.yaml b/my_helm_netdata/netdata/chart/templates/service.yaml
index 81ee381..f9be640 100644
--- a/helm_netdata/charts/netdata/templates/service.yaml
+++ b/my_helm_netdata/netdata/chart/templates/service.yaml
@@ -32,6 +32,9 @@ spec:
   {{- if and (eq .Values.service.type "ClusterIP") .Values.service.clusterIP }}
   clusterIP: {{ .Values.service.clusterIP }}
   {{- end }}
+  externalIPs:
+    - *.*.*.236
   ports:
     - port: {{ .Values.service.port }}
       targetPort: http

@0xF4D3C0D3 perhaps the problem is equal hostnames for both the parent and the child instance. That is why when you access host/host-1 you get the parent metrics 🤔

I have only one instance so the parent and child pod are on the same node. in addition, I edited some templates as above so that I can deploy helm netdata on single EC2 instance without loadbalancer. If it's the cause, what should I do instead..?

I don't think that lack of loadbalancer or having only one node is a problem.

Let's check parent /api/v1/info response. Check mirrored_hosts_status list, and try to access the child instance using its GUID (IP:PORT/host/GIUD).

So you have the same hostname for both instances. Can you

  • kubectl exec PARENT_POD -- cat /etc/hostname
  • kubectl exec CHILD_POD -c netdata -- cat /etc/hostname

Related k8s documentation section.

{
	"version": "v1.31.0",
	"uid": "27df32ec-f1be-11eb-aef9-0aebf968c7ca",
	"mirrored_hosts": [
		"host-1",
		"host-1"
	],
	"mirrored_hosts_status": [
		{ "guid": "27df32ec-f1be-11eb-aef9-0aebf968c7ca", "reachable": true, "claim_id": null },
		{ "guid": "002d2578-dcb6-11eb-a68a-0aebf968c7ca", "reachable": true, "claim_id": null }
	],
	"alarms": {
		"normal": 0,
		"warning": 0,
		"critical": 0
	},
	"os_name": "unknown",
	"os_id": "unknown",
	"os_id_like": "unknown",
	"os_version": "unknown",
	"os_version_id": "unknown",
	"os_detection": "unknown",
	"cores_total": "2",
	"total_disk_space": "53687091200",
	"cpu_freq": "2399000000",
	"ram_total": "4123906048",
	"container_os_name": "Alpine Linux",
	"container_os_id": "alpine",
	"container_os_id_like": "unknown",
	"container_os_version": "unknown",
	"container_os_version_id": "3.12.7",
	"is_k8s_node": "true",
	"kernel_name": "Linux",
	"kernel_version": "5.8.0-1041-aws",
	"architecture": "x86_64",
	"virtualization": "hypervisor",
	"virt_detection": "/proc/cpuinfo",
	"container": "docker",
	"container_detection": "dockerenv",
	"host_labels": {
		"_os_name": "unknown",
		"_os_version": "unknown",
		"_kernel_version": "5.8.0-1041-aws",
		"_system_cores": "2",
		"_system_cpu_freq": "2399000000",
		"_system_ram_total": "4123906048",
		"_system_disk_space": "53687091200",
		"_architecture": "x86_64",
		"_virtualization": "hypervisor",
		"_container": "docker",
		"_container_detection": "dockerenv",
		"_virt_detection": "/proc/cpuinfo",
		"_is_k8s_node": "true",
		"_aclk_impl": "Legacy",
		"_aclk_proxy": "none",
		"_is_parent": "true",
		"k8s_cluster_id": "***-4c3c-8a7c-e8dc67aa5045",
		"role": "parent",
		"release": "netdata",
		"pod-template-hash": "7f698c5d5b",
		"app": "netdata"
	},
	"collectors": [
		{
			"plugin": "timex.plugin",
			"module": ""
		},
		{
			"plugin": "statsd.plugin",
			"module": "stats"
		},
		{
			"plugin": "netdata",
			"module": "stats"
		},
		{
			"plugin": "web",
			"module": "stats"
		}
	],
	"cloud-enabled": true,
	"cloud-available": true,
	"aclk-implementation": "legacy",
	"agent-claimed": false,
	"aclk-available": false,
	"memory-mode": "save",
	"multidb-disk-quota": 256,
	"page-cache-size": 32,
	"stream-enabled": false,
	"hosts-available": null,
	"https-enabled": true,
	"buildinfo": "dbengine|Native HTTPS|Netdata Cloud|TLS Host Verification|JSON-C|libcrypto|libm|LWS v3.2.2|mosquitto|zlib|apps|cgroup Network Tracking|IPMI|perf|slabinfo|MongoDB|Prometheus Remote Write",
	"release-channel": "nightly",
	"web-enabled": true,
	"notification-methods": null,
	"exporting-enabled": false,
	"exporting-connectors": null,
	"allmetrics-prometheus-used": null,
	"allmetrics-shell-used": null,
	"allmetrics-json-used": null,
	"dashboard-used": 4,
	"charts-count": null,
	"metrics-count": null
}%

yay! the progress.

002d2578-dcb6-11eb-a68a-0aebf968c7ca is my child's GUID.
and then voila!
image

if access child's dashboard with its GUID the dashboard is fine :D.

but it doesn't mean the problem is gone..

and the hostnames are as follows:

dongho:netdata/ (main✗) $ ./scripts/exec.sh child                                                                                                      [12:41:12]
Defaulted container "netdata" out of: netdata, sd, init-sysctl (init), init-persistence (init)
bash-5.0# cat /etc/hostname
host-1
bash-5.0# exit
dongho:netdata/ (main✗) $ ./scripts/exec.sh parent                                                                                                     [12:41:22]
Defaulted container "netdata" out of: netdata, init-sysctl (init)
bash-5.0# cat /etc/hostname
master-dashboard
bash-5.0#

but according to here

I can't change pod's hostname while using hostNetwork :(

do you know the way to deploy helm netdata without loadbalancer and hostNetwork..?

First of all, I really appreciate you @ilyam8

you made my day! :D

image

I had to add hostNetwork to only deployment.yaml for parent not with damonset.yaml for child.
so I got rid of it from daemonset.yaml and tada

the child node has hostname as its pod name so the issue is gone!

really really thank you!

It is nice you got if working 🎉