Prometheus persistent storage settings for PVCs always get deleted. Node disks will be flooded.
jomeier opened this issue · 13 comments
Hi,
we tried to set persistence for Prometheus (default is emptyDir) with PVCs and block storageClass, because that is the proposed setup for OpenShift for production environments.
If we try do add that to the cluster-monitoring-config configmap, the storage (and retention settings) constantly get overwritten by this operator at this codeline here:
If we can't set persistent storage with PVCs or set the retention time / retention size, the nodes will get flooded and get stability problems.
Don't touch the settings in this configmap, please.
Thanks and greetings,
Josef
@jomeier can you elaborate or "will get flooded"?
We seen number of issues where if you set it to block storage it fails when pod are moved between azure availability zones as disk pv's are az zone bonded only. So its one issue over other.
Hi,
when Prometheus is using emptydir, it will just write its database in the containers filesystem, which will
a) have an impact on all other containers io-performance running on that node
b) lose all its data, whenever prometheus pod is restarted / node is restarted
c) will fill up the space of the node
Openshift Doc even mentions these and recommends using local storage for Prometheus, because of IO
https://docs.openshift.com/container-platform/4.7/monitoring/configuring-the-monitoring-stack.html#configuring-persistent-storage
Since youre speaking about problems with AZ in Azure, would that mean, that using azure block storage is generally not recommended to use? Its even preconfigured in the ARO cluster. Please clarify that.
https://kubernetes.io/docs/concepts/storage/storage-classes/#allowed-topologies
kind regards
Philipp
its not only Azure. All cloud providers as per my knowledge does not move block storage across the availability zones. This is known issue for ages. This is why people runs all "block storage applications" in quorums (ETCD, Casandra, Redis etc) spanned over 3 availability zones.
Applications relying on single data point in block storage and being bound to only 1 availability will go down in case of region outage (hackers news has number of discussions around this, problem old as cloud itself). All this is just application architecture. Not much we can do here.
Now related to original issue. We noticed that cluster upgrades get stuck due to fact if prometheus gets persisted to block storage and it is attempted to move it to other node in different zone during upgrade. This is why we removed persistence. This is system prometheus and we forward all data we need to support the service on ingestion so we can afford data lose (will happen on each upgrade). If you need prometheus, you should create application workload prometheus on your own.
Is this need to add disk is coming from the reasons you seen performance issues already or just as a precautions?
Overall azure disk performance is very "interesting". I would recommend reading Azure/AKS#1373 (just get some coffee before you start, its not an easy reading) . So just stabbing a disk onto Prometheus might not help much.
ok so what is the recommendation about using azure disk in ARO in general?
wouldnt this help, to have prometheus be deployed only in one AZ ?
https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/
We are not having any performance issues right now, we just try to follow best practices.
Same as in any other cloud - use it as long as its fits your architecture :) There is no such general recommendation.
You just need to be aware of limitations (from top of my head):
- IOPS counts depends on disk size
- Number of disk you can attach depends on VM size
- Disk do not move between AZ's.
- Block devices are RWOnce
- AzureFile is RWMany but NOT POSIX compliant so if you spin a database with azurefile I bet pint good Bavarian 🍺 it will not work.
And overall if you see performance issues due to emptyDir, please raise support ticket with details so we can investigate.
Regards the topology - This would make cluster a snowflake. You can do this for your own prometheus, but for us it would mean snowflake in the clusters fleet. And we are trying to avoid things like this.
It is possible but same applied - region outage will cause it to go down.
okay thanks for the detailed answer. I think we will just live with this then. It would be nice if this was mentioned in the docs somewhere though.
The problem we have is, that our currently Openshift 3 Cluster in Azure is spread only across a single availability zone. So we have to find a solution for migrating those appliactions to a Cluster that is using several AZ.
okay we just made a test with the storageclass parameter "zoned".
provisioner: kubernetes.io/azure-disk
parameters:
kind: Managed
storageaccounttype: Premium_LRS
zoned: 'true'
reclaimPolicy: Delete
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer
When this is set for azure disk storageclass, the kubernetes scheduler will make sure, that a pod is only assigned to a node, that belongs to the same zone as the disk.
When there is no more node available in that zone, you will get the following error: 0/9 nodes are available: 2 node(s) were unschedulable, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 4 node(s) had volume node affinity conflict.
This seems like a viable solution for us, so we could tell the prometheus to use a "zoned" storageclass, which will prevent its pods from being stuck.
Now the only thing, that prevents us from doing this, is the ARO operator, that makes this kind of setups impossible.
Now the only thing, that prevents us from doing this, is the ARO operator, that makes this kind of setups impossible.
For now this will not change. As said, this prometheus is system prometheus managed by SRE team. We will have to come back to this later.
could you elaborate, what "managed by SRE team" means? is there someone actively looking into our clusters health?
Step back, You are talking about Azure Redhat Openshift managed OpenShift which you created using az aro create
command?
If so - yes, there is all team dedicated to support those clusters
yes we are talking about ARO clusters, on-premises we dont have that problem, because there is no ARO operator, that cleanses our openshift-monitoring-config configmap.
so dont you think, we will run into problems, when we have prometheus running on emptydir? on our on-premises cluster we have 21GB of data, that prometheus collects per day.
That would mean, our ARO prometheus will fill up the node in less than 30 days.
You should not. If this happens and you suspect this to be a cause, raise a support ticket so we can look into this.