netdata/netdata

[Bug]: Clickhouse autodiscovery seems broken

artiommocrenco opened this issue ยท 7 comments

Bug description

Autodiscovery for clickhouse prometheus exporter just stopped working after a reboot, potentially due to an upgrade

Expected behavior

clickhouse_local metrics are collected and displayed in cloud dashboard

Steps to reproduce

Unknown

Installation method

kickstart.sh

System info

Ubuntu 22.04.4 LTS

Netdata build info

Packaging:
    Netdata Version ____________________________________________ : v1.45.0-190-nightly
    Installation Type __________________________________________ : binpkg-deb
    Package Architecture _______________________________________ : x86_64
    Package Distro _____________________________________________ :  
    Configure Options __________________________________________ : dummy-configure-command
Default Directories:
    User Configurations ________________________________________ : /etc/netdata
    Stock Configurations _______________________________________ : /usr/lib/netdata/conf.d
    Ephemeral Databases (metrics data, metadata) _______________ : /var/cache/netdata
    Permanent Databases ________________________________________ : /var/lib/netdata
    Plugins ____________________________________________________ : /usr/libexec/netdata/plugins.d
    Static Web Files ___________________________________________ : /var/lib/netdata/www
    Log Files __________________________________________________ : /var/log/netdata
    Lock Files _________________________________________________ : /var/lib/netdata/lock
    Home _______________________________________________________ : /var/lib/netdata
Operating System:
    Kernel _____________________________________________________ : Linux
    Kernel Version _____________________________________________ : 5.15.0-102-generic
    Operating System ___________________________________________ : Ubuntu
    Operating System ID ________________________________________ : ubuntu
    Operating System ID Like ___________________________________ : debian
    Operating System Version ___________________________________ : 22.04.4 LTS (Jammy Jellyfish)
    Operating System Version ID ________________________________ : none
    Detection __________________________________________________ : /etc/os-release
Hardware:
    CPU Cores __________________________________________________ : 32
    CPU Frequency ______________________________________________ : 5758000000
    RAM Bytes __________________________________________________ : 134733611008
    Disk Capacity ______________________________________________ : 19203769073664
    CPU Architecture ___________________________________________ : x86_64
    Virtualization Technology __________________________________ : none
    Virtualization Detection ___________________________________ : systemd-detect-virt
Container:
    Container __________________________________________________ : none
    Container Detection ________________________________________ : systemd-detect-virt
    Container Orchestrator _____________________________________ : none
    Container Operating System _________________________________ : none
    Container Operating System ID ______________________________ : none
    Container Operating System ID Like _________________________ : none
    Container Operating System Version _________________________ : none
    Container Operating System Version ID ______________________ : none
    Container Operating System Detection _______________________ : none
Features:
    Built For __________________________________________________ : Linux
    Netdata Cloud ______________________________________________ : YES
    Health (trigger alerts and send notifications) _____________ : YES
    Streaming (stream metrics to parent Netdata servers) _______ : YES
    Back-filling (of higher database tiers) ____________________ : YES
    Replication (fill the gaps of parent Netdata servers) ______ : YES
    Streaming and Replication Compression ______________________ : YES (zstd lz4 gzip)
    Contexts (index all active and archived metrics) ___________ : YES
    Tiering (multiple dbs with different metrics resolution) ___ : YES (5)
    Machine Learning ___________________________________________ : YES
Database Engines:
    dbengine (compression) _____________________________________ : YES (zstd lz4)
    alloc ______________________________________________________ : YES
    ram ________________________________________________________ : YES
    none _______________________________________________________ : YES
Connectivity Capabilities:
    ACLK (Agent-Cloud Link: MQTT over WebSockets over TLS) _____ : YES
    static (Netdata internal web server) _______________________ : YES
    h2o (web server) ___________________________________________ : YES
    WebRTC (experimental) ______________________________________ : NO
    Native HTTPS (TLS Support) _________________________________ : YES
    TLS Host Verification ______________________________________ : YES
Libraries:
    LZ4 (extremely fast lossless compression algorithm) ________ : YES
    ZSTD (fast, lossless compression algorithm) ________________ : YES
    zlib (lossless data-compression library) ___________________ : YES
    Brotli (generic-purpose lossless compression algorithm) ____ : NO
    protobuf (platform-neutral data serialization protocol) ____ : YES (system)
    OpenSSL (cryptography) _____________________________________ : YES
    libdatachannel (stand-alone WebRTC data channels) __________ : NO
    JSON-C (lightweight JSON manipulation) _____________________ : YES
    libcap (Linux capabilities system operations) ______________ : NO
    libcrypto (cryptographic functions) ________________________ : YES
    libyaml (library for parsing and emitting YAML) ____________ : YES
Plugins:
    apps (monitor processes) ___________________________________ : YES
    cgroups (monitor containers and VMs) _______________________ : YES
    cgroup-network (associate interfaces to CGROUPS) ___________ : YES
    proc (monitor Linux systems) _______________________________ : YES
    tc (monitor Linux network QoS) _____________________________ : YES
    diskspace (monitor Linux mount points) _____________________ : YES
    freebsd (monitor FreeBSD systems) __________________________ : NO
    macos (monitor MacOS systems) ______________________________ : NO
    statsd (collect custom application metrics) ________________ : YES
    timex (check system clock synchronization) _________________ : YES
    idlejitter (check system latency and jitter) _______________ : YES
    bash (support shell data collection jobs - charts.d) _______ : YES
    debugfs (kernel debugging metrics) _________________________ : YES
    cups (monitor printers and print jobs) _____________________ : YES
    ebpf (monitor system calls) ________________________________ : YES
    freeipmi (monitor enterprise server H/W) ___________________ : YES
    nfacct (gather netfilter accounting) _______________________ : YES
    perf (collect kernel performance events) ___________________ : YES
    slabinfo (monitor kernel object caching) ___________________ : YES
    Xen ________________________________________________________ : YES
    Xen VBD Error Tracking _____________________________________ : NO
    Logs Management ____________________________________________ : YES
Exporters:
    AWS Kinesis ________________________________________________ : NO
    GCP PubSub _________________________________________________ : NO
    MongoDB ____________________________________________________ : YES
    Prometheus (OpenMetrics) Exporter __________________________ : YES
    Prometheus Remote Write ____________________________________ : YES
    Graphite ___________________________________________________ : YES
    Graphite HTTP / HTTPS ______________________________________ : YES
    JSON _______________________________________________________ : YES
    JSON HTTP / HTTPS __________________________________________ : YES
    OpenTSDB ___________________________________________________ : YES
    OpenTSDB HTTP / HTTPS ______________________________________ : YES
    All Metrics API ____________________________________________ : YES
    Shell (use metrics in shell scripts) _______________________ : YES
Debug/Developer Features:
    Trace All Netdata Allocations (with charts) ________________ : NO
    Developer Mode (more runtime checks, slower) _______________ : NO

Additional info

This repo is archived starting Mar 27, 2024

In it, I found mentions of clickhouse_local here , but in netdata/netdata repo, I can see no mention of clickhouse_local.

switched netdata apt source from edge to stable, performed the upgrade to 1.45.3, observing the same issue, then downgrade to 1.45.0, still doesn't work

clickhouse exporter is reachable at http://127.0.0.1:9363/metrics as previously

Hi, @artiommocrenco. Is ClickHouse installed on the host or as a Docker container?

And please show the following:

# as root
/usr/libexec/netdata/plugins.d/local-listeners no-udp6 no-local no-inbound no-outbound no-namespaces

@ilyam8 thank you, its installed as apt package on the host

# /usr/libexec/netdata/plugins.d/local-listeners no-udp6 no-local no-inbound no-outbound no-namespaces
UDP|127.0.0.1|8125|/usr/sbin/netdata -P /run/netdata/netdata.pid -D
TCP|127.0.0.1|8125|/usr/sbin/netdata -P /run/netdata/netdata.pid -D
TCP|0.0.0.0|19999|/usr/sbin/netdata -P /run/netdata/netdata.pid -D
TCP6|::|19999|/usr/sbin/netdata -P /run/netdata/netdata.pid -D
TCP|127.0.0.1|8123|/usr/bin/clickhouse-server --config=/etc/clickhouse-server/config.xml --pid-file=/run/clickhouse-server/clickhouse-server.pid
TCP|127.0.0.1|9000|/usr/bin/clickhouse-server --config=/etc/clickhouse-server/config.xml --pid-file=/run/clickhouse-server/clickhouse-server.pid
TCP|127.0.0.1|9004|/usr/bin/clickhouse-server --config=/etc/clickhouse-server/config.xml --pid-file=/run/clickhouse-server/clickhouse-server.pid
TCP|127.0.0.1|9005|/usr/bin/clickhouse-server --config=/etc/clickhouse-server/config.xml --pid-file=/run/clickhouse-server/clickhouse-server.pid
TCP|127.0.0.1|9363|/usr/bin/clickhouse-server --config=/etc/clickhouse-server/config.xml --pid-file=/run/clickhouse-server/clickhouse-server.pid
TCP|127.0.0.1|9009|/usr/bin/clickhouse-server --config=/etc/clickhouse-server/config.xml --pid-file=/run/clickhouse-server/clickhouse-server.pid
TCP|0.0.0.0|22|sshd: /usr/sbin/sshd -D [listener] 1 of 10-100 startups
TCP6|::|22|sshd: /usr/sbin/sshd -D [listener] 1 of 10-100 startups
UDP|127.0.0.53|53|/lib/systemd/systemd-resolved
TCP|127.0.0.53|53|/lib/systemd/systemd-resolved
# curl http://127.0.0.1:9363/metrics | grep Uptime
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0# HELP ClickHouseAsyncMetrics_Uptime The server uptime in seconds. It includes the time spent for server initialization before accepting connections.
# TYPE ClickHouseAsyncMetrics_Uptime gauge
ClickHouseAsyncMetrics_Uptime 5675.772332555
# HELP ClickHouseAsyncMetrics_OSUptime The uptime of the host server (the machine where ClickHouse is running), in seconds.
# TYPE ClickHouseAsyncMetrics_OSUptime gauge
ClickHouseAsyncMetrics_OSUptime 5697.54
100  621k    0  621k    0     0  6611k      0 --:--:-- --:--:-- --:--:-- 6686k
# curl localhost:19999/api/v1/allmetrics?format=prometheus_all_hosts | grep ClickHouseAsyncMetrics_Uptime
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  502k  100  502k    0     0  27.3M      0 --:--:-- --:--:-- --:--:-- 28.8M

@artiommocrenco I am pretty sure the data collection job was created

journalctl _SYSTEMD_INVOCATION_ID="$(systemctl show --value --property=InvocationID netdata)" --namespace=netdata --grep click

You should see it in the UI: Nodes -> โš™๏ธ -> Collectors search: prom

Screenshot 2024-04-16 at 16 47 20

Related post in GH discussions #17170 (comment)

@ilyam8 Ah! I now know how to debug this! So much thank you for your help

So I now know what is the issue:

Job restart failed: 'http://127.0.0.1:9363/metrics' num of time series (2071) > limit (2000)

image

It is fixed by having this prometheus.conf:

jobs:
  - name: clickhouse_local
    url: http://127.0.0.1:9363/metrics
    max_time_series: 3000

Apparently, after clickhouse server upgrade, the number of series increased and collection was failing. This might qualify for a bug in netdata, so that max_time_series is also increased on your side

Thank you ๐Ÿฅ‡

I see you have - name: clickhouse_local but it is just clickhouse on the screenshot? Did you change it after?

Thanks for debugging the problem ๐Ÿ‘ I updated max_time_series to 3000 for clickhouse in #17415. It will work in the next nightly version. Make sure the name of the one in /etc/netdata/go.d/prometheus.conf is clickhouse_local, otherwise you will have 2 data collection jobs scraping ClickHouse.

@ilyam8 this is because I was testing, yes I did change it, now it's only clickhouse_local after we made it to work.

Thanks again ๐Ÿ‘