[Bug]: Clickhouse autodiscovery seems broken
artiommocrenco opened this issue ยท 7 comments
Bug description
Autodiscovery for clickhouse prometheus exporter just stopped working after a reboot, potentially due to an upgrade
Expected behavior
clickhouse_local metrics are collected and displayed in cloud dashboard
Steps to reproduce
Unknown
Installation method
kickstart.sh
System info
Ubuntu 22.04.4 LTS
Netdata build info
Packaging:
Netdata Version ____________________________________________ : v1.45.0-190-nightly
Installation Type __________________________________________ : binpkg-deb
Package Architecture _______________________________________ : x86_64
Package Distro _____________________________________________ :
Configure Options __________________________________________ : dummy-configure-command
Default Directories:
User Configurations ________________________________________ : /etc/netdata
Stock Configurations _______________________________________ : /usr/lib/netdata/conf.d
Ephemeral Databases (metrics data, metadata) _______________ : /var/cache/netdata
Permanent Databases ________________________________________ : /var/lib/netdata
Plugins ____________________________________________________ : /usr/libexec/netdata/plugins.d
Static Web Files ___________________________________________ : /var/lib/netdata/www
Log Files __________________________________________________ : /var/log/netdata
Lock Files _________________________________________________ : /var/lib/netdata/lock
Home _______________________________________________________ : /var/lib/netdata
Operating System:
Kernel _____________________________________________________ : Linux
Kernel Version _____________________________________________ : 5.15.0-102-generic
Operating System ___________________________________________ : Ubuntu
Operating System ID ________________________________________ : ubuntu
Operating System ID Like ___________________________________ : debian
Operating System Version ___________________________________ : 22.04.4 LTS (Jammy Jellyfish)
Operating System Version ID ________________________________ : none
Detection __________________________________________________ : /etc/os-release
Hardware:
CPU Cores __________________________________________________ : 32
CPU Frequency ______________________________________________ : 5758000000
RAM Bytes __________________________________________________ : 134733611008
Disk Capacity ______________________________________________ : 19203769073664
CPU Architecture ___________________________________________ : x86_64
Virtualization Technology __________________________________ : none
Virtualization Detection ___________________________________ : systemd-detect-virt
Container:
Container __________________________________________________ : none
Container Detection ________________________________________ : systemd-detect-virt
Container Orchestrator _____________________________________ : none
Container Operating System _________________________________ : none
Container Operating System ID ______________________________ : none
Container Operating System ID Like _________________________ : none
Container Operating System Version _________________________ : none
Container Operating System Version ID ______________________ : none
Container Operating System Detection _______________________ : none
Features:
Built For __________________________________________________ : Linux
Netdata Cloud ______________________________________________ : YES
Health (trigger alerts and send notifications) _____________ : YES
Streaming (stream metrics to parent Netdata servers) _______ : YES
Back-filling (of higher database tiers) ____________________ : YES
Replication (fill the gaps of parent Netdata servers) ______ : YES
Streaming and Replication Compression ______________________ : YES (zstd lz4 gzip)
Contexts (index all active and archived metrics) ___________ : YES
Tiering (multiple dbs with different metrics resolution) ___ : YES (5)
Machine Learning ___________________________________________ : YES
Database Engines:
dbengine (compression) _____________________________________ : YES (zstd lz4)
alloc ______________________________________________________ : YES
ram ________________________________________________________ : YES
none _______________________________________________________ : YES
Connectivity Capabilities:
ACLK (Agent-Cloud Link: MQTT over WebSockets over TLS) _____ : YES
static (Netdata internal web server) _______________________ : YES
h2o (web server) ___________________________________________ : YES
WebRTC (experimental) ______________________________________ : NO
Native HTTPS (TLS Support) _________________________________ : YES
TLS Host Verification ______________________________________ : YES
Libraries:
LZ4 (extremely fast lossless compression algorithm) ________ : YES
ZSTD (fast, lossless compression algorithm) ________________ : YES
zlib (lossless data-compression library) ___________________ : YES
Brotli (generic-purpose lossless compression algorithm) ____ : NO
protobuf (platform-neutral data serialization protocol) ____ : YES (system)
OpenSSL (cryptography) _____________________________________ : YES
libdatachannel (stand-alone WebRTC data channels) __________ : NO
JSON-C (lightweight JSON manipulation) _____________________ : YES
libcap (Linux capabilities system operations) ______________ : NO
libcrypto (cryptographic functions) ________________________ : YES
libyaml (library for parsing and emitting YAML) ____________ : YES
Plugins:
apps (monitor processes) ___________________________________ : YES
cgroups (monitor containers and VMs) _______________________ : YES
cgroup-network (associate interfaces to CGROUPS) ___________ : YES
proc (monitor Linux systems) _______________________________ : YES
tc (monitor Linux network QoS) _____________________________ : YES
diskspace (monitor Linux mount points) _____________________ : YES
freebsd (monitor FreeBSD systems) __________________________ : NO
macos (monitor MacOS systems) ______________________________ : NO
statsd (collect custom application metrics) ________________ : YES
timex (check system clock synchronization) _________________ : YES
idlejitter (check system latency and jitter) _______________ : YES
bash (support shell data collection jobs - charts.d) _______ : YES
debugfs (kernel debugging metrics) _________________________ : YES
cups (monitor printers and print jobs) _____________________ : YES
ebpf (monitor system calls) ________________________________ : YES
freeipmi (monitor enterprise server H/W) ___________________ : YES
nfacct (gather netfilter accounting) _______________________ : YES
perf (collect kernel performance events) ___________________ : YES
slabinfo (monitor kernel object caching) ___________________ : YES
Xen ________________________________________________________ : YES
Xen VBD Error Tracking _____________________________________ : NO
Logs Management ____________________________________________ : YES
Exporters:
AWS Kinesis ________________________________________________ : NO
GCP PubSub _________________________________________________ : NO
MongoDB ____________________________________________________ : YES
Prometheus (OpenMetrics) Exporter __________________________ : YES
Prometheus Remote Write ____________________________________ : YES
Graphite ___________________________________________________ : YES
Graphite HTTP / HTTPS ______________________________________ : YES
JSON _______________________________________________________ : YES
JSON HTTP / HTTPS __________________________________________ : YES
OpenTSDB ___________________________________________________ : YES
OpenTSDB HTTP / HTTPS ______________________________________ : YES
All Metrics API ____________________________________________ : YES
Shell (use metrics in shell scripts) _______________________ : YES
Debug/Developer Features:
Trace All Netdata Allocations (with charts) ________________ : NO
Developer Mode (more runtime checks, slower) _______________ : NO
Additional info
This repo is archived starting Mar 27, 2024
In it, I found mentions of clickhouse_local
here , but in netdata/netdata repo, I can see no mention of clickhouse_local
.
switched netdata apt source from edge
to stable
, performed the upgrade to 1.45.3, observing the same issue, then downgrade to 1.45.0, still doesn't work
clickhouse exporter is reachable at http://127.0.0.1:9363/metrics
as previously
Hi, @artiommocrenco. Is ClickHouse installed on the host or as a Docker container?
And please show the following:
# as root
/usr/libexec/netdata/plugins.d/local-listeners no-udp6 no-local no-inbound no-outbound no-namespaces
@ilyam8 thank you, its installed as apt package on the host
# /usr/libexec/netdata/plugins.d/local-listeners no-udp6 no-local no-inbound no-outbound no-namespaces
UDP|127.0.0.1|8125|/usr/sbin/netdata -P /run/netdata/netdata.pid -D
TCP|127.0.0.1|8125|/usr/sbin/netdata -P /run/netdata/netdata.pid -D
TCP|0.0.0.0|19999|/usr/sbin/netdata -P /run/netdata/netdata.pid -D
TCP6|::|19999|/usr/sbin/netdata -P /run/netdata/netdata.pid -D
TCP|127.0.0.1|8123|/usr/bin/clickhouse-server --config=/etc/clickhouse-server/config.xml --pid-file=/run/clickhouse-server/clickhouse-server.pid
TCP|127.0.0.1|9000|/usr/bin/clickhouse-server --config=/etc/clickhouse-server/config.xml --pid-file=/run/clickhouse-server/clickhouse-server.pid
TCP|127.0.0.1|9004|/usr/bin/clickhouse-server --config=/etc/clickhouse-server/config.xml --pid-file=/run/clickhouse-server/clickhouse-server.pid
TCP|127.0.0.1|9005|/usr/bin/clickhouse-server --config=/etc/clickhouse-server/config.xml --pid-file=/run/clickhouse-server/clickhouse-server.pid
TCP|127.0.0.1|9363|/usr/bin/clickhouse-server --config=/etc/clickhouse-server/config.xml --pid-file=/run/clickhouse-server/clickhouse-server.pid
TCP|127.0.0.1|9009|/usr/bin/clickhouse-server --config=/etc/clickhouse-server/config.xml --pid-file=/run/clickhouse-server/clickhouse-server.pid
TCP|0.0.0.0|22|sshd: /usr/sbin/sshd -D [listener] 1 of 10-100 startups
TCP6|::|22|sshd: /usr/sbin/sshd -D [listener] 1 of 10-100 startups
UDP|127.0.0.53|53|/lib/systemd/systemd-resolved
TCP|127.0.0.53|53|/lib/systemd/systemd-resolved
# curl http://127.0.0.1:9363/metrics | grep Uptime
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0# HELP ClickHouseAsyncMetrics_Uptime The server uptime in seconds. It includes the time spent for server initialization before accepting connections.
# TYPE ClickHouseAsyncMetrics_Uptime gauge
ClickHouseAsyncMetrics_Uptime 5675.772332555
# HELP ClickHouseAsyncMetrics_OSUptime The uptime of the host server (the machine where ClickHouse is running), in seconds.
# TYPE ClickHouseAsyncMetrics_OSUptime gauge
ClickHouseAsyncMetrics_OSUptime 5697.54
100 621k 0 621k 0 0 6611k 0 --:--:-- --:--:-- --:--:-- 6686k
# curl localhost:19999/api/v1/allmetrics?format=prometheus_all_hosts | grep ClickHouseAsyncMetrics_Uptime
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 502k 100 502k 0 0 27.3M 0 --:--:-- --:--:-- --:--:-- 28.8M
@artiommocrenco I am pretty sure the data collection job was created
journalctl _SYSTEMD_INVOCATION_ID="$(systemctl show --value --property=InvocationID netdata)" --namespace=netdata --grep click
You should see it in the UI: Nodes -> โ๏ธ -> Collectors search: prom
Related post in GH discussions #17170 (comment)
@ilyam8 Ah! I now know how to debug this! So much thank you for your help
So I now know what is the issue:
Job restart failed: 'http://127.0.0.1:9363/metrics' num of time series (2071) > limit (2000)
It is fixed by having this prometheus.conf
:
jobs:
- name: clickhouse_local
url: http://127.0.0.1:9363/metrics
max_time_series: 3000
Apparently, after clickhouse server upgrade, the number of series increased and collection was failing. This might qualify for a bug in netdata, so that max_time_series
is also increased on your side
Thank you ๐ฅ
I see you have - name: clickhouse_local
but it is just clickhouse
on the screenshot? Did you change it after?
Thanks for debugging the problem ๐ I updated max_time_series
to 3000 for clickhouse in #17415. It will work in the next nightly version. Make sure the name of the one in /etc/netdata/go.d/prometheus.conf
is clickhouse_local
, otherwise you will have 2 data collection jobs scraping ClickHouse.
@ilyam8 this is because I was testing, yes I did change it, now it's only clickhouse_local after we made it to work.
Thanks again ๐