[Bug]: Clickhouse autodiscovery seems broken

Question

[Bug]: Clickhouse autodiscovery seems broken

artiommocrenco opened this issue 2 months ago · 7 comments

artiommocrenco commented 2 months ago

Bug description

Autodiscovery for clickhouse prometheus exporter just stopped working after a reboot, potentially due to an upgrade

Expected behavior

clickhouse_local metrics are collected and displayed in cloud dashboard

Steps to reproduce

Unknown

Installation method

kickstart.sh

System info

Ubuntu 22.04.4 LTS

Netdata build info

Packaging:
    Netdata Version ____________________________________________ : v1.45.0-190-nightly
    Installation Type __________________________________________ : binpkg-deb
    Package Architecture _______________________________________ : x86_64
    Package Distro _____________________________________________ :  
    Configure Options __________________________________________ : dummy-configure-command
Default Directories:
    User Configurations ________________________________________ : /etc/netdata
    Stock Configurations _______________________________________ : /usr/lib/netdata/conf.d
    Ephemeral Databases (metrics data, metadata) _______________ : /var/cache/netdata
    Permanent Databases ________________________________________ : /var/lib/netdata
    Plugins ____________________________________________________ : /usr/libexec/netdata/plugins.d
    Static Web Files ___________________________________________ : /var/lib/netdata/www
    Log Files __________________________________________________ : /var/log/netdata
    Lock Files _________________________________________________ : /var/lib/netdata/lock
    Home _______________________________________________________ : /var/lib/netdata
Operating System:
    Kernel _____________________________________________________ : Linux
    Kernel Version _____________________________________________ : 5.15.0-102-generic
    Operating System ___________________________________________ : Ubuntu
    Operating System ID ________________________________________ : ubuntu
    Operating System ID Like ___________________________________ : debian
    Operating System Version ___________________________________ : 22.04.4 LTS (Jammy Jellyfish)
    Operating System Version ID ________________________________ : none
    Detection __________________________________________________ : /etc/os-release
Hardware:
    CPU Cores __________________________________________________ : 32
    CPU Frequency ______________________________________________ : 5758000000
    RAM Bytes __________________________________________________ : 134733611008
    Disk Capacity ______________________________________________ : 19203769073664
    CPU Architecture ___________________________________________ : x86_64
    Virtualization Technology __________________________________ : none
    Virtualization Detection ___________________________________ : systemd-detect-virt
Container:
    Container __________________________________________________ : none
    Container Detection ________________________________________ : systemd-detect-virt
    Container Orchestrator _____________________________________ : none
    Container Operating System _________________________________ : none
    Container Operating System ID ______________________________ : none
    Container Operating System ID Like _________________________ : none
    Container Operating System Version _________________________ : none
    Container Operating System Version ID ______________________ : none
    Container Operating System Detection _______________________ : none
Features:
    Built For __________________________________________________ : Linux
    Netdata Cloud ______________________________________________ : YES
    Health (trigger alerts and send notifications) _____________ : YES
    Streaming (stream metrics to parent Netdata servers) _______ : YES
    Back-filling (of higher database tiers) ____________________ : YES
    Replication (fill the gaps of parent Netdata servers) ______ : YES
    Streaming and Replication Compression ______________________ : YES (zstd lz4 gzip)
    Contexts (index all active and archived metrics) ___________ : YES
    Tiering (multiple dbs with different metrics resolution) ___ : YES (5)
    Machine Learning ___________________________________________ : YES
Database Engines:
    dbengine (compression) _____________________________________ : YES (zstd lz4)
    alloc ______________________________________________________ : YES
    ram ________________________________________________________ : YES
    none _______________________________________________________ : YES
Connectivity Capabilities:
    ACLK (Agent-Cloud Link: MQTT over WebSockets over TLS) _____ : YES
    static (Netdata internal web server) _______________________ : YES
    h2o (web server) ___________________________________________ : YES
    WebRTC (experimental) ______________________________________ : NO
    Native HTTPS (TLS Support) _________________________________ : YES
    TLS Host Verification ______________________________________ : YES
Libraries:
    LZ4 (extremely fast lossless compression algorithm) ________ : YES
    ZSTD (fast, lossless compression algorithm) ________________ : YES
    zlib (lossless data-compression library) ___________________ : YES
    Brotli (generic-purpose lossless compression algorithm) ____ : NO
    protobuf (platform-neutral data serialization protocol) ____ : YES (system)
    OpenSSL (cryptography) _____________________________________ : YES
    libdatachannel (stand-alone WebRTC data channels) __________ : NO
    JSON-C (lightweight JSON manipulation) _____________________ : YES
    libcap (Linux capabilities system operations) ______________ : NO
    libcrypto (cryptographic functions) ________________________ : YES
    libyaml (library for parsing and emitting YAML) ____________ : YES
Plugins:
    apps (monitor processes) ___________________________________ : YES
    cgroups (monitor containers and VMs) _______________________ : YES
    cgroup-network (associate interfaces to CGROUPS) ___________ : YES
    proc (monitor Linux systems) _______________________________ : YES
    tc (monitor Linux network QoS) _____________________________ : YES
    diskspace (monitor Linux mount points) _____________________ : YES
    freebsd (monitor FreeBSD systems) __________________________ : NO
    macos (monitor MacOS systems) ______________________________ : NO
    statsd (collect custom application metrics) ________________ : YES
    timex (check system clock synchronization) _________________ : YES
    idlejitter (check system latency and jitter) _______________ : YES
    bash (support shell data collection jobs - charts.d) _______ : YES
    debugfs (kernel debugging metrics) _________________________ : YES
    cups (monitor printers and print jobs) _____________________ : YES
    ebpf (monitor system calls) ________________________________ : YES
    freeipmi (monitor enterprise server H/W) ___________________ : YES
    nfacct (gather netfilter accounting) _______________________ : YES
    perf (collect kernel performance events) ___________________ : YES
    slabinfo (monitor kernel object caching) ___________________ : YES
    Xen ________________________________________________________ : YES
    Xen VBD Error Tracking _____________________________________ : NO
    Logs Management ____________________________________________ : YES
Exporters:
    AWS Kinesis ________________________________________________ : NO
    GCP PubSub _________________________________________________ : NO
    MongoDB ____________________________________________________ : YES
    Prometheus (OpenMetrics) Exporter __________________________ : YES
    Prometheus Remote Write ____________________________________ : YES
    Graphite ___________________________________________________ : YES
    Graphite HTTP / HTTPS ______________________________________ : YES
    JSON _______________________________________________________ : YES
    JSON HTTP / HTTPS __________________________________________ : YES
    OpenTSDB ___________________________________________________ : YES
    OpenTSDB HTTP / HTTPS ______________________________________ : YES
    All Metrics API ____________________________________________ : YES
    Shell (use metrics in shell scripts) _______________________ : YES
Debug/Developer Features:
    Trace All Netdata Allocations (with charts) ________________ : NO
    Developer Mode (more runtime checks, slower) _______________ : NO

Additional info

This repo is archived starting Mar 27, 2024

In it, I found mentions of clickhouse_local here , but in netdata/netdata repo, I can see no mention of clickhouse_local.

Answer 1 · 2024-04-16T13:21:16.000Z

switched netdata apt source from edge to stable, performed the upgrade to 1.45.3, observing the same issue, then downgrade to 1.45.0, still doesn't work

clickhouse exporter is reachable at http://127.0.0.1:9363/metrics as previously

Answer 2 · 2024-04-16T13:24:36.000Z

Hi, @artiommocrenco. Is ClickHouse installed on the host or as a Docker container?

And please show the following:

# as root
/usr/libexec/netdata/plugins.d/local-listeners no-udp6 no-local no-inbound no-outbound no-namespaces

Answer 3 · 2024-04-16T13:30:41.000Z

@ilyam8 thank you, its installed as apt package on the host

# /usr/libexec/netdata/plugins.d/local-listeners no-udp6 no-local no-inbound no-outbound no-namespaces
UDP|127.0.0.1|8125|/usr/sbin/netdata -P /run/netdata/netdata.pid -D
TCP|127.0.0.1|8125|/usr/sbin/netdata -P /run/netdata/netdata.pid -D
TCP|0.0.0.0|19999|/usr/sbin/netdata -P /run/netdata/netdata.pid -D
TCP6|::|19999|/usr/sbin/netdata -P /run/netdata/netdata.pid -D
TCP|127.0.0.1|8123|/usr/bin/clickhouse-server --config=/etc/clickhouse-server/config.xml --pid-file=/run/clickhouse-server/clickhouse-server.pid
TCP|127.0.0.1|9000|/usr/bin/clickhouse-server --config=/etc/clickhouse-server/config.xml --pid-file=/run/clickhouse-server/clickhouse-server.pid
TCP|127.0.0.1|9004|/usr/bin/clickhouse-server --config=/etc/clickhouse-server/config.xml --pid-file=/run/clickhouse-server/clickhouse-server.pid
TCP|127.0.0.1|9005|/usr/bin/clickhouse-server --config=/etc/clickhouse-server/config.xml --pid-file=/run/clickhouse-server/clickhouse-server.pid
TCP|127.0.0.1|9363|/usr/bin/clickhouse-server --config=/etc/clickhouse-server/config.xml --pid-file=/run/clickhouse-server/clickhouse-server.pid
TCP|127.0.0.1|9009|/usr/bin/clickhouse-server --config=/etc/clickhouse-server/config.xml --pid-file=/run/clickhouse-server/clickhouse-server.pid
TCP|0.0.0.0|22|sshd: /usr/sbin/sshd -D [listener] 1 of 10-100 startups
TCP6|::|22|sshd: /usr/sbin/sshd -D [listener] 1 of 10-100 startups
UDP|127.0.0.53|53|/lib/systemd/systemd-resolved
TCP|127.0.0.53|53|/lib/systemd/systemd-resolved

# curl http://127.0.0.1:9363/metrics | grep Uptime
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0# HELP ClickHouseAsyncMetrics_Uptime The server uptime in seconds. It includes the time spent for server initialization before accepting connections.
# TYPE ClickHouseAsyncMetrics_Uptime gauge
ClickHouseAsyncMetrics_Uptime 5675.772332555
# HELP ClickHouseAsyncMetrics_OSUptime The uptime of the host server (the machine where ClickHouse is running), in seconds.
# TYPE ClickHouseAsyncMetrics_OSUptime gauge
ClickHouseAsyncMetrics_OSUptime 5697.54
100  621k    0  621k    0     0  6611k      0 --:--:-- --:--:-- --:--:-- 6686k

# curl localhost:19999/api/v1/allmetrics?format=prometheus_all_hosts | grep ClickHouseAsyncMetrics_Uptime
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  502k  100  502k    0     0  27.3M      0 --:--:-- --:--:-- --:--:-- 28.8M

Answer 4 · 2024-04-16T13:50:10.000Z

@artiommocrenco I am pretty sure the data collection job was created

journalctl _SYSTEMD_INVOCATION_ID="$(systemctl show --value --property=InvocationID netdata)" --namespace=netdata --grep click

You should see it in the UI: Nodes -> ⚙️ -> Collectors search: prom

Related post in GH discussions #17170 (comment)

Answer 5 · 2024-04-16T13:59:22.000Z

@ilyam8 Ah! I now know how to debug this! So much thank you for your help

So I now know what is the issue:

Job restart failed: 'http://127.0.0.1:9363/metrics' num of time series (2071) > limit (2000)

It is fixed by having this prometheus.conf:

jobs:
  - name: clickhouse_local
    url: http://127.0.0.1:9363/metrics
    max_time_series: 3000

Apparently, after clickhouse server upgrade, the number of series increased and collection was failing. This might qualify for a bug in netdata, so that max_time_series is also increased on your side

Thank you 🥇

Answer 6 · 2024-04-16T14:37:41.000Z

I see you have - name: clickhouse_local but it is just clickhouse on the screenshot? Did you change it after?

Thanks for debugging the problem 👍 I updated max_time_series to 3000 for clickhouse in #17415. It will work in the next nightly version. Make sure the name of the one in /etc/netdata/go.d/prometheus.conf is clickhouse_local, otherwise you will have 2 data collection jobs scraping ClickHouse.

Answer 7 · 2024-04-16T14:39:18.000Z

@ilyam8 this is because I was testing, yes I did change it, now it's only clickhouse_local after we made it to work.

Thanks again 👍