[Bug]: Incorrect bloat calculation in go.d.plugin, postgres module
cuu508 opened this issue · 6 comments
Bug description
I'm monitoring a postgres database with netdata. Netdata reports a particular index as having high bloat (above 70%, which triggers a warning, which is how I noticed). Looking into it, it appears as if netdata is stuck reporting a high value:
If I manually run the query from
the query output, the problematic index does not show up in the query results at all.If I check index bloat using another query from here, it reports index bloat as 21% for the specific index.
I suspect that:
- the bloat calculation query is missing data for some of the indexes
- if a row from the bloat calculation query disappears, netdata does not handle it correctly (in the graphs, and for the alerting, it keeps using the last known value)
Expected behavior
Steps to reproduce
...
Installation method
manual setup of official DEB/RPM packages
System info
/etc/lsb-release:DISTRIB_ID=Ubuntu
/etc/lsb-release:DISTRIB_RELEASE=22.04
/etc/lsb-release:DISTRIB_CODENAME=jammy
/etc/lsb-release:DISTRIB_DESCRIPTION="Ubuntu 22.04.4 LTS"
/etc/os-release:PRETTY_NAME="Ubuntu 22.04.4 LTS"
/etc/os-release:NAME="Ubuntu"
/etc/os-release:VERSION_ID="22.04"
/etc/os-release:VERSION="22.04.4 LTS (Jammy Jellyfish)"
/etc/os-release:VERSION_CODENAME=jammy
/etc/os-release:ID=ubuntu
/etc/os-release:ID_LIKE=debian
/etc/os-release:UBUNTU_CODENAME=jammy
Netdata build info
Packaging:
Netdata Version ____________________________________________ : v1.45.3
Installation Type __________________________________________ : binpkg-deb
Package Architecture _______________________________________ : x86_64
Package Distro _____________________________________________ :
Configure Options __________________________________________ : dummy-configure-command
Default Directories:
User Configurations ________________________________________ : /etc/netdata
Stock Configurations _______________________________________ : /usr/lib/netdata/conf.d
Ephemeral Databases (metrics data, metadata) _______________ : /var/cache/netdata
Permanent Databases ________________________________________ : /var/lib/netdata
Plugins ____________________________________________________ : /usr/libexec/netdata/plugins.d
Static Web Files ___________________________________________ : /var/lib/netdata/www
Log Files __________________________________________________ : /var/log/netdata
Lock Files _________________________________________________ : /var/lib/netdata/lock
Home _______________________________________________________ : /var/lib/netdata
Operating System:
Kernel _____________________________________________________ : Linux
Kernel Version _____________________________________________ : 6.5.0-26-generic
Operating System ___________________________________________ : Ubuntu
Operating System ID ________________________________________ : ubuntu
Operating System ID Like ___________________________________ : debian
Operating System Version ___________________________________ : 22.04.4 LTS (Jammy Jellyfish)
Operating System Version ID ________________________________ : none
Detection __________________________________________________ : /etc/os-release
Hardware:
CPU Cores __________________________________________________ : 32
CPU Frequency ______________________________________________ : 5083000000
RAM Bytes __________________________________________________ : 134962692096
Disk Capacity ______________________________________________ : 7681511964672
CPU Architecture ___________________________________________ : x86_64
Virtualization Technology __________________________________ : none
Virtualization Detection ___________________________________ : systemd-detect-virt
Container:
Container __________________________________________________ : none
Container Detection ________________________________________ : systemd-detect-virt
Container Orchestrator _____________________________________ : none
Container Operating System _________________________________ : none
Container Operating System ID ______________________________ : none
Container Operating System ID Like _________________________ : none
Container Operating System Version _________________________ : none
Container Operating System Version ID ______________________ : none
Container Operating System Detection _______________________ : none
Features:
Built For __________________________________________________ : Linux
Netdata Cloud ______________________________________________ : YES
Health (trigger alerts and send notifications) _____________ : YES
Streaming (stream metrics to parent Netdata servers) _______ : YES
Back-filling (of higher database tiers) ____________________ : YES
Replication (fill the gaps of parent Netdata servers) ______ : YES
Streaming and Replication Compression ______________________ : YES (zstd lz4 gzip)
Contexts (index all active and archived metrics) ___________ : YES
Tiering (multiple dbs with different metrics resolution) ___ : YES (5)
Machine Learning ___________________________________________ : YES
Database Engines:
dbengine ___________________________________________________ : YES
alloc ______________________________________________________ : YES
ram ________________________________________________________ : YES
none _______________________________________________________ : YES
Connectivity Capabilities:
ACLK (Agent-Cloud Link: MQTT over WebSockets over TLS) _____ : YES
static (Netdata internal web server) _______________________ : YES
h2o (web server) ___________________________________________ : YES
WebRTC (experimental) ______________________________________ : NO
Native HTTPS (TLS Support) _________________________________ : YES
TLS Host Verification ______________________________________ : YES
Libraries:
LZ4 (extremely fast lossless compression algorithm) ________ : YES
ZSTD (fast, lossless compression algorithm) ________________ : YES
zlib (lossless data-compression library) ___________________ : YES
Brotli (generic-purpose lossless compression algorithm) ____ : NO
protobuf (platform-neutral data serialization protocol) ____ : YES (system)
OpenSSL (cryptography) _____________________________________ : YES
libdatachannel (stand-alone WebRTC data channels) __________ : NO
JSON-C (lightweight JSON manipulation) _____________________ : YES
libcap (Linux capabilities system operations) ______________ : NO
libcrypto (cryptographic functions) ________________________ : YES
libyaml (library for parsing and emitting YAML) ____________ : YES
Plugins:
apps (monitor processes) ___________________________________ : YES
cgroups (monitor containers and VMs) _______________________ : YES
cgroup-network (associate interfaces to CGROUPS) ___________ : YES
proc (monitor Linux systems) _______________________________ : YES
tc (monitor Linux network QoS) _____________________________ : YES
diskspace (monitor Linux mount points) _____________________ : YES
freebsd (monitor FreeBSD systems) __________________________ : NO
macos (monitor MacOS systems) ______________________________ : NO
statsd (collect custom application metrics) ________________ : YES
timex (check system clock synchronization) _________________ : YES
idlejitter (check system latency and jitter) _______________ : YES
bash (support shell data collection jobs - charts.d) _______ : YES
debugfs (kernel debugging metrics) _________________________ : YES
cups (monitor printers and print jobs) _____________________ : YES
ebpf (monitor system calls) ________________________________ : YES
freeipmi (monitor enterprise server H/W) ___________________ : YES
nfacct (gather netfilter accounting) _______________________ : YES
perf (collect kernel performance events) ___________________ : YES
slabinfo (monitor kernel object caching) ___________________ : YES
Xen ________________________________________________________ : YES
Xen VBD Error Tracking _____________________________________ : NO
Logs Management ____________________________________________ : YES
Exporters:
AWS Kinesis ________________________________________________ : NO
GCP PubSub _________________________________________________ : NO
MongoDB ____________________________________________________ : YES
Prometheus (OpenMetrics) Exporter __________________________ : YES
Prometheus Remote Write ____________________________________ : YES
Graphite ___________________________________________________ : YES
Graphite HTTP / HTTPS ______________________________________ : YES
JSON _______________________________________________________ : YES
JSON HTTP / HTTPS __________________________________________ : YES
OpenTSDB ___________________________________________________ : YES
OpenTSDB HTTP / HTTPS ______________________________________ : YES
All Metrics API ____________________________________________ : YES
Shell (use metrics in shell scripts) _______________________ : YES
Debug/Developer Features:
Trace All Netdata Allocations (with charts) ________________ : NO
Developer Mode (more runtime checks, slower) _______________ : NO
Additional info
No response
PS PostgreSQL version:
select version();
version
-------------------------------------------------------------------------------------------------------------------------------------
PostgreSQL 13.14 (Ubuntu 13.14-1.pgdg22.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bit
@cuu508, hey 👋 While go.d.plugin gathers data every update_every
, it collects and calculates bloat statistics less frequently, only every 5 minutes. In between calculations, go.d.plugin relies on the most recently calculated bloat statistics (cached data). This is because the bloat query can be resource-intensive. This 5-minute interval is not configurable at the moment.
In my case, the dashboard was showing the outdated value for hours (and a warning was active too) .
Also, running the bloat calculation query manually did not return the calculated bloat value for that specific index at all (the row was absent). In summary:
- netdata dashboard (both local node and netdata cloud) was showing an outdated value, 72%
- a different bloat calculation query was showing 21%
- netdata's bloat calculation query was omitting that specific index from bloat calculation results
@cuu508 I understood that, thanks! I think I fixed it in #17598
a different bloat calculation query was showing 21%
Netdata uses bloat query from bucardo/check_postgres. I am not a Postgres expert, so I can't say which query is better/more correct.
Ah, sorry, I missed the referenced commit!
I compared bucardo's and netdata's bloat calculation queries. They are identical except the netdata version has at the very end:
WHERE sml.relpages - otta > 10 OR ipages - iotta > 10
I think this is the bit responsible for netdata missing bloat data for some of my indexes. I have one particular index for which bucardo's query calculates ibloat as 0.5:
-[ RECORD 91 ]---+-----------------------------------------------------
db | ***
schemaname | public
tablename | api_ping
tups | 108233496
pages | 1469731
otta | 1637553
tbloat | 0.9
wastedpages | 0
wastedbytes | 0
wastedsize | 0 bytes
iname | api_ping_pkey
itups | 107589248
ipages | 522850
iotta | 1000750
ibloat | 0.5
wastedipages | 0
wastedibytes | 0
wastedisize | 0 bytes
totalwastedbytes | 0
But since ipages is lower than iotta, it is omitted from the netdata version.
The other query calculates the bloat as 43% for this same index. (edit:) but some of the calculated bloat values looked wacky, -247% and such. In the bucardo query at least all ibloat values were zero or positive)