DevOps-Nirvana/Kubernetes-Volume-Autoscaler

Autoscaling size below current size and PVC size not human readable.

GuillaumeOuint opened this issue · 9 comments

Sometimes, the autoscaler tries to resize a PVC with a size below current size, raising an error.

Volume infra.data-nfs-server-provisioner-1637948923-0 is 85% in-use of the 80Gi available
  BECAUSE it is above 80% used
  ALERT has been for 1306 period(s) which needs to at least 5 period(s) to scale
  AND we need to scale it immediately, it has never been scaled previously
  RESIZING disk from 86G to 20G
  Exception raised while trying to scale up PVC infra.data-nfs-server-provisioner-1637948923-0 to 20000000000 ...
(422)
Reason: Unprocessable Entity
HTTP response headers: HTTPHeaderDict({'Audit-Id': 'e69b53c3-d332-4925-b9ea-afa7570297a9', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': 'b64e47c9-2a4e-48ae-83bc-355685b6c007', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'e5841496-62d0-426a-a987-4b26ec143a20', 'Date': 'Sat, 22 Oct 2022 16:58:07 GMT', 'Content-Length': '520'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"PersistentVolumeClaim \"data-nfs-server-provisioner-1637948923-0\" is invalid: spec.resources.requests.storage: Forbidden: field can not be less than previous value","reason":"Invalid","details":{"name":"data-nfs-server-provisioner-1637948923-0","kind":"PersistentVolumeClaim","causes":[{"reason":"FieldValueForbidden","message":"Forbidden: field can not be less than previous value","field":"spec.resources.requests.storage"}]},"code":422}


FAILED requesting to scale up `infra.data-nfs-server-provisioner-1637948923-0` by `10%` from `86G` to `20G`, it was using more than `80%` disk space over the last `78360 seconds`

I'm using the helm chart version 1.0.3 (same image tag)

Another issue, the autoscaler was able to resize another PVC from 13Gi to 14173392076, this is not human readable as before. It's not a serious issue but it's still disturbing. The autoscaler also sent the alert to slack twice for this PVC with several hours interval.

@GuillaumeOuint Can you enable verbose mode? It'll print out a LOT more data, and hopefully you can gist or pastebin or something the verbose dump of the failure. If you're worried about issues arising please enable dry-run mode also. For reference to enable those either set the values in your helm values file, or use the CLI args when installing the helm chart. See: https://github.com/DevOps-Nirvana/Kubernetes-Volume-Autoscaler/#installation-with-helm

Also, could you please provide all the other custom (non-default) values you've set in this helm chart? Thanks!

@AndrewFarley I've manually resized the PVCs so I have to wait some days for a new trigger.
I've enabled verbose and dry-run mode. I leave this issue open for further details in the future.

For the helm chart config, I've just edited slack webhook, channel, prometheus URL and:

scale_above_percent: "80"
scale_after_intervals: "5"
scale_cooldown_time: "22200"
scale_up_max_increment: "2000000000"
scale_up_max_size: "20000000000"
scale_up_min_increment: "1000000000"
scale_up_percent: "10"

@GuillaumeOuint Alright, yeah all your settings look good nothing funny. I'm eager to hear what verbose mode shows. Even now for the infra.data-nfs-server-provisioner-1637948923-0 object, can you copy/paste what it shows in there? It should show something like this...

Volume infrastructure.prometheus-server is 62% in-use of the 10Gi available
  VERBOSE DETAILS:
    name: prometheus-server
    volume_size_spec: 10Gi
    volume_size_spec_bytes: 10737418240
    volume_size_status: 10Gi
    volume_size_status_bytes: 10737418240
    namespace: infrastructure
    storage_class: gp3-retain
    resource_version: 8498015
    uid: bfc7827c-56ac-4bdf-b79e-06090e400294
    last_resized_at: 0
    scale_above_percent: 80
    scale_after_intervals: 5
    scale_up_percent: 20
    scale_up_min_increment: 1000000000
    scale_up_max_increment: 16000000000000
    scale_up_max_size: 16000000000000
    scale_cooldown_time: 22200
    ignore: False
 and is not above 80%

Those raw values above are what I need. Thanks!

Also, I would ask what is the provisioner @GuillaumeOuint? Please get the yaml of the storageclass you're using and dump it here for me. Thanks!

Hi @AndrewFarley, here is the verbose output for infra.data-nfs-server-provisioner-1637948923-0:

Volume infra.data-nfs-server-provisioner-1637948923-0 is 80% in-use of the 85Gi available
  VERBOSE DETAILS:
    name: data-nfs-server-provisioner-1637948923-0
    volume_size_spec: 85Gi
    volume_size_spec_bytes: 91268055040
    volume_size_status: 85Gi
    volume_size_status_bytes: 91268055040
    namespace: infra
    storage_class: scw-bssd
    resource_version: 7847312419
    uid: da592cfc-e8bc-4ebe-9715-f00c6d838020
    last_resized_at: 0
    scale_above_percent: 80
    scale_after_intervals: 5
    scale_up_percent: 10
    scale_up_min_increment: 1000000000
    scale_up_max_increment: 2000000000
    scale_up_max_size: 20000000000
    scale_cooldown_time: 22200
    ignore: False
  BECAUSE it is above 80% used
  ALERT has been for 844 period(s) which needs to at least 5 period(s) to scale
  AND we need to scale it immediately, it has never been scaled previously
  DRY RUN was set, but we would have resized this disk from 91G to 20G

I use scaleway as provider, here is the storage class:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: scw-bssd
  labels:
    k8s.scw.cloud/object: StorageClass
    k8s.scw.cloud/system: csi
provisioner: csi.scaleway.com
reclaimPolicy: Delete
allowVolumeExpansion: true
volumeBindingMode: Immediate

@GuillaumeOuint I see the flaw. The problem is that you've set the scale_up_max_size: "20000000000" which is 20Gi, which is less than what this disk needs to resize as. Please leave this bug open, as I will make this service handle this edge case gracefully by printing out that there's a potential error in your configuration because it can't scale down.

To fix: Remove your overriding of the value: scale_up_max_size OR simply adjust it to the maximum you'd like to scale up into that is above the current size of your nfs volume. Eg: Set it to "200000000000" (this has one extra 0 in it than what you have above).

Cheers!

@AndrewFarley Oh snap, I didn't realize this typo error on my side, sorry.
If I can suggest one thing, make values more human readable, as this would prevent issues like this and provide better understanding of the config / logs. I think kubernetes handles theses human values (ex 10Gi) well but the autoscaler tend to use and write it in bytes.

@GuillaumeOuint Sounds like a great feature, I'll make them human readable. Thanks for the idea!

About to release 1.0.5, with verbose details which shows exactly what the values are in the logs, and it would catch that error you ran into in the logs.

  VERBOSE DETAILS:
-------------------------------------------------------------------------------------------------------------
                        name: prometheus-server
            volume_size_spec: 10Gi
      volume_size_spec_bytes: 10737418240 (11G)
          volume_size_status: 10Gi
    volume_size_status_bytes: 10737418240 (11G)
                   namespace: infrastructure
               storage_class: gp3-retain
            resource_version: 8498015
                         uid: bfc7827c-56ac-4bdf-b79e-06090e400294
             last_resized_at: 0 (1970-01-01 00:00:00 UTC +0000)
         scale_above_percent: 80%
       scale_after_intervals: 5
            scale_up_percent: 20%
      scale_up_min_increment: 1000000000 (1G)
      scale_up_max_increment: 16000000000000 (16T)
           scale_up_max_size: 16000000000000 (16T)
         scale_cooldown_time: 22200 (06:10:00)
                      ignore: False
-------------------------------------------------------------------------------------------------------------