kubernetes-sigs/nfs-ganesha-server-and-external-provisioner

[Proposal] Multi-server storage class and provisioner

elibixby opened this issue · 4 comments

Volume resizing and splitting servers by subpaths has two problems in my usage:

  • Volume resizing doesn't happen automatically (or at least I haven't been able to figure that out), so the underlying storage is either overprovisioned, or may block PVCs using this storage class. While this isn't technically wrong, it breaks the contracts many cloud providers have with their storage classes (roughly unlimited storage), and reduces the utility of dynamic volume provisioning.
  • Volume resizing is much more difficult to do (both manually and automatically) if the resizing exceeds the underlying storage for PVCs deployed with this storage class (since both the provisioner's volume and the the volume the PVC is bound to must be resized)
  • Even if resizing was made automatic, volumes can't be resized down, and so the underlying storage class would be dramatically overprovisioned.

One solution might be a different Provisioner that provisions NFS servers and the storage that back them according to parameters of the storage class, the PVC that triggered provisioning, or the pod that bound the triggering PVC (if volumeBindingMode: WaitForFIrstConsumer)


Mode 1: If the user wants to be able to provision PVCs ahead of time.

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: nfs-fast
  labels:
       storageClass: 'fast'
provisioner: cluster.local/multi-nfs-provisioner
parameters:
  storageClass: 'fast'
  resources:
    requests:
        cpu: 100/1Ti
        memory: 250/1Ti
  topology:
    ...
  affinities:
    ...
reclaimPolicy: Retain
allowVolumeExpansion: true
volumeBindingMode: Immediate

Mode2: User is ok with slower provisioning in exchange for better utilization, and keeping the server near its consumers.

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: nfs-fast-dynamic
  labels: 
       storageClass: 'fast'
parameters:
  storageClass: 'fast'
  resources:
    requests:
        cpu: 100m/1Ti
        memory: 250m/1Ti
provisioner: cluster.local/multi-nfs-provisioner
reclaimPolicy: Delete
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer

When a new volume is requested with storage class nfs-fast A nfs server backed by a dynamically provisioned volume of storageClass with topology/affinity constraints either from parameters or the bound pod, and size from the requesting PVC.
Additionally resource requests could be specified in terms of the storage size requested, since for most volume types I/O performance scales with size (here, for every terabyte of requested storage server would request 100m CPU and 250m memory)

The statefulset for the provisioner, instead of maintaining a mapping from PVs to subpaths, would maintain a mapping from PVs to servers.

Summary of advantages over current approach:

  • Additional storage can be dynamically requested and released from underlying storage by PVCs as expected.
  • Better isolation of PVC requests (PVCs can't spill above the storage of the underlying volume and take away from "guaranteed storage" for other PVCs).
  • NFS Server requests can be scaled according to underlying disk size, allowing users of cloud providers where I/O scales with drive size to better saturate their drives
  • For multi-zone clusters allows for colocation of server and PVC when bound just-in-time.
  • Easier for users to add a new "kind" of NFS storage based on another storage class (rather than deploying a second helm chart, just create a new parameterized storage class)

I'm relatively new to the CSI ecosystem but I'm wondering if something like this is feasible/has been thought about.

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.