kubernetes-retired/external-storage

OpenEBS node-disk-manager

jsafrane opened this issue ยท 22 comments

What's our opinion about OpenEBS node disk manager (NDM)?

https://github.com/openebs/node-disk-manager
openebs/node-disk-manager#1
https://docs.google.com/presentation/d/1XcCWQL_WfhGzNjIlnL1b0kpiCvqKaUtEh9XXU2gypn4/

We could probably save some effort on both sides if we cooperate. For example, NDM's StoragePool idea looks like our LVM-based dynamic provisioner. And I personally like automated discovery of local disks that I'd need to deploy Gluster or Ceph on top of local PVs.

To me it seems that NDM is trying to solve similar use case as we, it's only more focused on the installation / discovery of the devices to consume as PVs, while Kubernetes focuses on the runtime aspects how to use the local devices (i.e. schedule and run pods). IMO, it would make sense to merge NDM with our local provisioner or at least make the integration as easy as possible for both sides.

/area local-volume
@ianchakeres @msau42 @davidz627 @cofyc @dhirajh @humblec ?
(did I forget anyone?)

@jsafrane I have a very good opinion about node disk manager effort from openebs. Thats one of the reason for me to actually involve in some discussions around this with OpenEBS folks so the mentions about gluster operator and such in the design proposal. In reality, node disk discovery and handling of these components is a common problem. Different vendors solve this issue in different manner. Having a good common solution is a must thing considering storage handling is important in an orchestrator like Kube. It also helps to avoid many vendor specific efforts with limitations here and there. Our local storage provisioner is a (good) start, however the openebs proposal has some good thoughts which could be adopted or sharpened with a community contribution and possibly merge with local volume provisioner. In short, I am all for it and eagerly waiting for next plan of actions.

@kmova @umamukkara @epowell
Additional Reference#
https://blog.openebs.io/achieving-native-hyper-convergence-in-kubernetes-cb93e0bcf5d3

kubernetes/kubernetes#58569

I took a quick glance and it looks promising at providing an end to end disk management solution for distributed storage applications. The metrics and health monitoring aspects look very useful, and it should solve the issue of managing disks for DaemonSet-based providers.

I'm trying to think about how this could be integrated with the local-pv-provisioner from two angles:

  • simplify the disk discovery aspect for static PV provisioning
    • the ongoing design for adding fs formatting and mounting into the local volume plugin will help with this too by eliminating the need to preformat and mount local block devices
  • automate more of the LVM provisioner by implementing an LVM StoragePool plugin

For both use cases, I think there are still some challenges around categorization of disks to StoragePools that would need to be ironed out. IIUC, ndm creates a Disk object for every block device in the system, so it would be up to the StoragePool implementation to further filter which Disks to use. And an implementation MUST filter out the disks, otherwise it could end up stepping on the root filesystem or other K8s volume plugins.

Filtering using Disk.Vendor + Disk.Model may be sufficient if you want all similar disks to be in the same StoragePool. The challenges I see are about how to support more advanced disk configurations:

  • You want to divide disks of the same type across multiple StoragePools
  • Some environments may not want to directly use the whole disk, and instead partition them, or put them under raid.

Local PV provisioner didn't solve this and instead required users to prep and categorize the disks beforehand. While I can see some simpler use cases being simplified by ndm, I'm not sure what is the best way to solve the more advanced ones.

kmova commented

@msau42 @jsafrane Thanks for the review and inputs!

@humblec and I have been discussing on how to keep the disk inventory and storage pool implementation generic so it can be used in multiple scenarios. We have made some progress on the following (will shortly update the design PR):

  • ndm creates a Disk object for every block device in the system.
  • ndm can track if the block device moved from one node to the other node within the cluster.
  • ndm can help with partitioning (and expose the partitioned disks as block devices)

We definitely need more help/feedback in terms of advanced usecases and API design.

@kmova I'm wondering if it would be simpler to use PVs as your disk inventory instead of a new Disk CRD object. The advantage is that you can reuse existing PVC/PV implementation to handle dynamic provisioning, and attaching of volumes to nodes.

@kmova @msau42 I am trying to understand the slide titled "Complementing Local PV". Currently the local provisioner crawls through the discovery directory to find volumes to create PVs for. It appears that with NDM one could add another form of discovery where NDM uses its own discovery mechanism to create local PVs. This seems like a useful enhancement to me assuming its adding another mechanism and not replacing existing discovery mechanism.

I guess the question about using Disk CRs, I would like to better understand what information it actually stores. I assume to support operations like unplugging and moving disks, the Disk CR stores more information than one would put in a Local PV. Its life cycle might also be a bit different from a PV as a result. If that is the case, then keeping the disk CR might make sense. Again, need to understand what the information in the CR is and how it is used.

kmova commented

@msau42 - Using PV in place of a new Disk CR, I was getting into the following challenges:

  • In some usecases, we would want to replace the disk without having to restart the pod. In these cases, the pods can run with /dev mount and can take a configuration (say configmap) with the list of disks they can operate on. (kubernetes/kubernetes#58569)
  • Representing partitioned disks hierarchy (this probably could still be addressed, by ensuring that parent disk (PV) can be stopped from being assigned to a PVC )

Another consideration was from the usability/operations perspective. for example: the management tools around kubernetes like weave scope that can represent these disks as visual elements, with ability to blink, get iostats etc.,

kmova commented

@dhirajh This disk CR can store details like:

capacity
serial number
model
vendor
physical location (ie enclosure slot number)
rotational speed (if hard disk)
sector size
write cache
FW revision level
extend log pages

In addition, as part of dynamic attributes or monitored metrics:

state (online, removed, etc)
status (normal, faulted)
temperature (when applicable) 
smart errors

@kmova I think it's still possible to use PVs for inventory management. You don't necessarily have to mount PVC directly in the spec. If it's in your future roadmap to support all kinds of volume types, such as cloud block storage, then using the PVC abstraction could be used to provide dynamic provisioning and disk attachment capabilities as well.

kmova commented

@msau42 - IIUC the PVs can be created by the ndm and the additional disk attributes could be added under annotations (or may be under an extended spec?). For covering the case for kubernetes/kubernetes#58569 - the pod can still mount the "/dev" and the configuration can specify the PV objects it can use - which will have the path information.

I like the idea of using the PVC abstractions for dynamic provisioning. How do we get the PVs attached to the node without adding them to a Deployment/App spec?

Getting the PVs attached to the node is the hard part because it is tied to Pod scheduling. Having a Pod per PV is probably not going to scale, and you have to handle cases like the pod getting evicted. I'm not sure if leveraging the VolumeAttachement object would work, it may conflict/confuse the Attach/Detach controller.

@msau42 I do think, expanding volumeattachment object for local storage/disk handling could complex things. Its good to have it on other/new api object or some custom CRDs like ndm currently has. If custom CRD for a disk object is not optimal , we may think about a new api object for disk/local storage handling IMO.

@kmova I feel, we should also have node mapping in Disk CRD. That will help us a lot when considering scheduling or backtracking or some decision making based on this object.

kmova commented

@humblec - yes we can get the topology labels from the node where the disks are discovered and attach them to the Disk objects.

Example:

kind: Disk
metadata:
  name: disk-3d8b1efad969208d6bf5971f2c34dd0c
  labels:
    "kubernetes.io/hostname": "gke-openebs-user-default-pool-044afcb8-bmc0"

In addition, I have included based on feedback to be able to fetch additional information that can describe how disks are attached - via internal bus, HBA, or SAS expanders etc., This information can be used while provisioning latency sensitive pools.

Agree, I don't see a great way to handle attached disk types without always forcing some Pod to be on the node. I think Disk CRD could work fine if you only plan on supporting local disks. But since other volume types were mentioned in the roadmap, I was trying to envision how things like provisioning and attaching could be supported without having to reimplement volume plugins and much of the Kubernetes volume subsystem.

cc @travisn @bassam for rook and @jcsp for ceph

I'd like see some convergence on disk object schemas

As an alternative datapoint, I spoke a bit with @dhirajh about how they deploy Ceph in their datacenter. He mentioned that they use StatefulSets, and each replica (OSD) manages just one local PV. All replicas use the same class and capacity of disk, and instead, there is a higher level operator that manages multiple StatefulSets and balances them across fault domains (ie rack). This operator is in charge of making sure that capacity is equal across fault domains, and can scale up each StatefulSet when more Ceph capacity is requested. With this architecture, they don't need their Ceph pods to manage multiple disks, and a disk failure is contained to a single replica, so they can use PVCs directly. For cases where nodes have different amount of disks and capacity, the operator can create more StatefulSets to use them.

kmova commented

Thanks @msau42 thats a good data point. I will add this into the design document. Along with this I will also gather additional details on usecases where the storage pods need multiple PVs and the expected behaviour when using SPDK to access disks.

@msau42 @kmova IMO there are good amount of use cases where a storage pod need more than one local PV. For e.g# sometimes the storage pod have to keep its own metadata in one pv and other PV for data volumes or serving volume create requests. In other angle, one local PV may not be sufficient to serve all the PVC request comes from the kube user. Atleast in Gluster we support around 1000 Volumes from a 3 node gluster cluster. Just attaching one disk and carving out space from it may not be sufficient.

In Gluster's case, also, it would be fairly heavy-weight to have one GlusterFS pod per device on a node. Nevermind that it would also limit per-node scale-out expansion, which is one of the core features of Gluster.

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.