Collect from current node only information about node's VMs/CTs/etc
maelstrom256 opened this issue · 10 comments
Currently, it collects data from every cluster node full set of cluster-wide metrics.
If you have twelve nodes in one cluster, you'll get a dozen of identical data.
It's an extremely duping of data and excessive resources waste.
There are multiple strategies to mitigate this problem. From the top of my head:
- Setup a round robin DNS entry for scraping. I.e. instead of configuring 12 targets in your prometheus config, only specify the round robin DNS record containing all the pve nodes you want to scrape.
- Only collect metrics from a subset of those nodes. For best results select nodes which are never down at the same time.
In the second case you might want to relabel the collected metrics after scraping and replace the instance label with something which does not change between machines.
The simplest relabel config is the following;
metric_relabel_configs:
- action: replace
target_label: instance
replacement: mycluster.example.com
- action: labeldrop
regex: exported_instance
It will force instance=mycluster.example.com
and drop exported_instance
completely.
Thank you for recommendations.
But, round-robin will lead to hanging requests if one node become unavailable. So, losing one node will lead to missing any information about what's going on — and just in the time when this information is viable.
Am I wrong?
And, for a second variant, node
parameter is hidden inside structures, pve-exporter
itself returns just a bunch of data, that will be labelled as an atomic unit…
Can't get a key how it will work.
But, round-robin will lead to hanging requests if one node become unavailable
It is certainly possible to configure two or more DNS round robin records and then distribute the proxmox IPs into those sets. Then scrape multiple DNS round robin records.
just a bunch of data, that will be labelled as an atomic unit
Relabelling is a constant source of confusion for prometheus users. I suggest to start with Life of a Label and then follow as many posts in the robust perception blog as necessary...
The ultimate goal of the metric_relabel_configs
I've posted above is that any node can be scraped but the resulting metrics are identical with respect to the metric labels.
Ok, after couple of tries I've realized, that task cannot be ever done with any label manipulations.
pve_up{id="lxc/1077"} 1.0
pve_disk_size_bytes{id="lxc/1077"} 1.073741824e+011
pve_disk_usage_bytes{id="lxc/1077"} 7.857635328e+09
pve_memory_size_bytes{id="lxc/1077"} 3.4359738368e+010
pve_memory_usage_bytes{id="lxc/1077"} 4.165922816e+09
…et cetera…
are missing node
label, and
pve_disk_size_bytes{id="node/pve1"} 3.33615857664e+011
pve_disk_size_bytes{id="storage/pve1/local-zfs"} 4.40453726208e+011
pve_memory_size_bytes{id="node/pve1"} 9.986310144e+010
pve_memory_usage_bytes{id="node/pve1"} 7.409995776e+010
…et cetera…
having the node
label melted with others, or it just occasion and it's not a node
anyway.
But,
pve_storage_info{id="storage/pve1/local-zfs",node="pve1",storage="local-zfs"} 1.0
pve_node_info{id="node/pve1",level="",name="pve1",nodeid="1"} 1.0
pve_onboot_status{id="lxc/1077",node="pve1",type="lxc"} 1.0
are include node
label (but it called name
in second line).
Thus, exporter just does not provide the complete infoset to operate with, and relabel will not help.
Also exported data are lacking SMART and phys disk info, that Proxmox provides.
Over all above, in a cluster dashboard shows quantum temperature of last scattering surface rather than CPU load for nodes (not for CTs/VMs), but it is not exporter's problem, anyway.
Possibly I can fix this, but it requires rewriting whole pipeline.
Do you like to know on which proxmox node a given container/VM is running?
Maybe you need joins (see this blog)? I suggest the following query to try and see whether this is what you need:
pve_cpu_usage_ratio * on (id) group_left(node) pve_guest_info
Do you like to know on which proxmox node a given container/VM is running?
Before all the things, I want to separate one node data from others.
VM, CT, Storage.
It helps greately, if every metric will be labeled with her appropriate node.
Also, it will be very nice to separate VM's/CT's Storage from Node Storage, for now they are mixed in pve_disk_size_bytes
, and no size and usage in pve_storage*
Maybe you need joins
Without data separation by node you cannot calculate CPU/Memory/Disk/Net/any_other_load by node properly, because you cannot sum load by node, because you have no node information in CT/VM, isn't?
Yes, you can.
CPU load per node:
pve_cpu_usage_ratio * on (id) group_left(name) pve_node_info
CPU load per VM/CT with name
and node
labels added from pve_guest_info
:
pve_cpu_usage_ratio * on (id) group_left(name) pve_guest_info
Disk usage per node with node
label added from pve_storage_info
:
pve_disk_usage_bytes * on(id) group_left(node) pve_storage_info
Keep reading the robust perception blog...
@maelstrom256 @znerol
Little late to this party as I was just getting started when I seen this question. It got me thinking about using HAProxy to frontend this. What I decided to do was run my 3 proxmox server in a round robin backend using HAProxy. I created a 4th DNS record and an additional IP to frontend the 3 servers and to point Prometheus to it instead of my 3 proxmox servers.
So far it's working well. Here's a sample configuration of HA proxy frontend and backend.
frontend proxmox-prometheus
bind 10.5.17.9:8006 name 10.5.17.9:8006
mode tcp
log global
timeout client 30000
tcp-request inspect-delay 5s
acl proxmox-prometheus req.ssl_sni -i server-proxmox-0.domain.com
tcp-request content accept if { req.ssl_hello_type 1 }
use_backend proxmox-prometheus-back_ipvANY if proxmox-prometheus
backend proxmox-prometheus-back_ipvANY
mode tcp
id 127
log global
balance roundrobin
timeout connect 30000
timeout server 30000
retries 3
option httpchk GET /
server server-proxmox-1 10.2.3.51:8006 id 104 check-ssl check inter 1000 weight 10 verify none
server server-proxmox-2 10.2.3.52:8006 id 114 check-ssl check inter 1000 weight 10 verify none
server server-proxmox-3 10.2.3.53:8006 id 115 check-ssl check inter 1000 weight 10 verify none