- About
- Design
- Prerequisites
- Supported Kubernetes versions
- Setup
- Automated testing
- Communication and contribution
This Kubernetes plugin is a Persistent Memory Container Storage Interface (PMEM-CSI) driver for provisioning of node-local non-volatile memory to Kubernetes as block devices. The driver can currently utilize non-volatile memory devices that can be controlled via the libndctl utility library. In this readme, we use persistent memory to refer to a non-volatile dual in-line memory module (NVDIMM).
The PMEM-CSI driver follows the CSI specification by listening for API requests and provisioning volumes accordingly.
The PMEM-CSI driver can operate in two different DeviceModes: LVM and Direct
The following diagram illustrates the operation in DeviceMode:LVM:
In DeviceMode:LVM PMEM-CSI driver uses LVM for Logical Volumes Management to avoid the risk of fragmentation. The LVM logical volumes are served to satisfy API requests. There is one Volume Group created per Region, ensuring the region-affinity of served volumes.
The driver consists of three separate binaries that form two initialization stages and a third API-serving stage.
During startup, the driver scans persistent memory for regions and namespaces, and tries to create more namespaces using all or part (selectable via option) of the remaining available space. The namespace size can be specified as a driver parameter and defaults to 32 GB. This first stage is performed by a separate entity pmem-ns-init.
The second stage of initialization arranges physical volumes provided by namespaces into LVM volume groups. This is performed by a separate binary pmem-vgm.
After two initialization stages, the third binary pmem-csi-driver starts serving CSI API requests.
The PMEM-CSI driver can pre-create Namespaces in two modes, forming corresponding LVM Volume groups, to serve volumes based on fsdax
or sector
(alias safe
) mode Namespaces. The amount of space to be used is determined using two options -useforfsdax
and -useforsector
given to pmem-ns-init. These options specify an integer presenting limit as percentage, which is applied separately in each Region. The default values are useforfsdax=100
and useforsector=0
. A CSI request for volume can specify the Namespace mode using the driver-specific argument nsmode
which has a value of either "fsdax" (default) or "sector". A volume provisioned in fsdax
mode will have the dax
option added to mount options.
The PMEM-CSI driver can leave space on devices for others, and recognize "own" namespaces. Leaving space for others can be achieved by specifying lower-than-100 values to -useforfsdax
and/or -useforsector
options. The distinction "own" vs. "foreign" is implemented by setting the Name field in Namespace to a static string "pmem-csi" during Namespace creation. When adding Physical Volumes to Volume Groups, only Physical Volumes that are based on Namespaces with the name "pmem-csi" are considered.
The following diagram illustrates the operation in DeviceMode:Direct:
In DeviceMode:Direct PMEM-CSI driver allocates Namespaces directly from the storage device. This creates device space fragmentation risk, but reduces complexity and run-time overhead by avoiding additional device mapping layer. Direct mode also ensures the region-affinity of served volumes, because provisioned volume can belong to one Region only.
In Direct mode, the two preparation stages used in LVM mode, are not needed.
The PMEM-CSI driver creates a Namespace directly in the mode which is asked by Volume creation request, thus bypassing the complexity of pre-allocated pools that are used in DeviceMode:LVM.
In DeviceMode:Direct, the driver does not attempt to limit space use. It also does not mark "own" namespaces. The Name field of a Namespace gets value of the VolumeID.
The PMEM-CSI driver supports running in different modes, which can be controlled by passing one of the below options to the driver's '-mode' command line option. In each mode, it starts a different set of open source Remote Procedure Call (gRPC) servers on given driver endpoint(s).
-
Controller mode is intended to be used in a multi-node cluster and should run as a single instance in cluster level. When the driver is running in Controller mode, it forwards the pmem volume create/delete requests to the registered node controller servers running on the worker node. In this mode, the driver starts the following gRPC servers:
-
Node mode is intended to be used in a multi-node cluster by worker nodes that have persistent memory devices installed. When the driver starts in this mode, it registers with the Controller driver running on a given -registryEndpoint. In this mode, the driver starts the following servers:
-
Unified mode is intended to run the driver in a single host, mostly for testing the driver in a non-clustered environment.
This gRPC server operates on a given endpoint in all driver modes and implements the CSI Identity interface.
When the PMEM-CSI driver runs in Controller mode, it starts a gRPC server on a given endpoint(-registryEndpoint) and serves the RegistryServer interface. The driver(s) running in Node mode can register themselves with node specific information such as node id, NodeControllerServer endpoint, and their available persistent memory capacity.
This gRPC server is started by the PMEM-CSI driver running in Controller mode and serves the Controller interface defined by the CSI specification. The server responds to CreateVolume(), DeleteVolume(), ControllerPublishVolume(), ControllerUnpublishVolume(), and ListVolumes() calls coming from external-provisioner and external-attacher sidecars. It forwards the publish and unpublish volume requests to the appropriate Node controller server running on a worker node that was registered with the driver.
This gRPC server is started by the PMEM-CSI driver running in Node mode and implements the ControllerPublishVolume and ControllerUnpublishVolume methods of the Controller service interface defined by the CSI specification. It serves the ControllerPublishVolume() and ControllerUnpublish() requests coming from the Master controller server and creates/deletes persistent memory devices.
This gRPC server is started by the driver running in Node mode and implements the Node service interface defined in the CSI specification. It serves the NodeStageVolume(), NodeUnstageVolume(), NodePublishVolume(), and NodeUnpublishVolume() requests coming from the Container Orchestrator (CO).
The following diagram illustrates the communication channels between driver components:
All pmem-csi specific communication shown in above section between Master Controller(RegistryServer, MasterControllerServer) and NodeControllers(NodeControllerServer) is protected by mutual TLS. Both client and server must identify themselves and the certificate they present must be trusted. The common name in each certificate is used to identify the different components. The following common names have a special meaning:
pmem-registry
is used by the RegistryServer.pmem-node-controller
is used by NodeControllerServers
The test/setup-ca-kubernetes.sh
script shows how to generate certificates signed by Kubernetes cluster root Certificate Authority. And the provided deployment files shows how to use the generated certificates to setup the driver. The test cluster is setup using certificates created by that script. The test/setup-ca.sh
script also shows how to generate self signed certificates. These are just examples, administrators of a cluster must ensure that they choose key lengths and algorithms of sufficient strength for their purposes and manage certificate distribution.
A production deployment can improve upon that by using some other key delivery mechanism, like for example Vault.
The following diagram illustrates how the PMEM-CSI driver performs dynamic volume provisioning in Kubernetes:
Building has been verified using these components:
- Go: version 1.10.1 (go 1.11 is required for 'make test`)
- ndctl versions 62..64, either built on dev.host via autogen, configure, make, and install as per instruction in README.md, or installed as ndctl package(s) from distribution repository.
Building of Docker images has an additional requirement:
- Docker-ce: verified using version 18.06.1
Persistent memory device(s) are required for operation. However, some development and testing can be done using QEMU-emulated persistent memory devices, see README-qemu-notes.
The driver does not create persistent memory Regions, but expects Regions to exist when the driver starts. The utility ipmctl can be used to create Regions.
The driver deployment in Kubernetes cluster has been verified on:
Branch | Kubernetes branch/version |
---|---|
devel | Kubernetes 1.11 branch v1.11.3 |
devel | Kubernetes 1.12 |
devel | Kubernetes 1.13 |
Early development and verification was performed on QEMU-emulated persistent memory devices.
The build was verified on the system described below:
- Host: Dell Poweredge R620, distro: openSUSE Leap 15.0, kernel 4.12.14, qemu 2.11.2
- Guest VM: 32GB RAM, 8 vCPUs, Ubuntu 18.04.1 server, kernel 4.15.0, 4.18.0, 4.19.1
- See README-qemu-notes for more details about VM config
Use the command: git clone https://github.com/intel/pmem-csi
-
Use
make
This produces the following binaries in the
_output
directory:pmem-ns-init
: Helper utility for namespace initialization.pmem-vgm
: Helper utility for creating logical volume groups over PMEM devices created.pmem-csi-driver
: PMEM-CSI driver.
-
Use
make build-images
to produce Docker container images. -
Use
make push-images
to push Docker container images to a Docker images registry. The default is to push to a local Docker registry.
See the Makefile for additional make targets and possible make variables.
This is useful in development/trial mode.
Use util/run-lvm-unified
as user:root.
This runs two preparation parts, and starts the driver binary, which listens and responds to API use on a TCP socket. You can modify this to use a Unix socket, if needed.
Use util/run-direct-unified
as user:root to start the driver in DeviceMode:direct. This script skips the two preparation stages and starts the driver binary with corresponding DeviceMode option.
This section assumes that a Kubernetes cluster is already available with at least one node that has persistent memory device(s). For development or testing, it is also possible to use a cluster that runs on QEMU virtual machines, see the "End-to-end testing" section below.
- Label the cluster nodes that have persistent memory support
$ kubectl label node pmem-csi-4 storage=pmem
The label storage: pmem needs to be added to the cluster node that provides persistent memory device(s).
Clusters with multiple nodes with persistent memory are not fully supported at the moment. Support for this will be added when making the CSI driver topology- aware.
- **Deploy the driver to Kubernetes using DeviceMode:LVM **
$ sed -e 's/192.168.8.1:5000/<your registry>/' deploy/kubernetes-1.12/pmem-csi-lvm.yaml | kubectl create -f -
- **Deploy the driver to Kubernetes using DeviceMode:Direct **
$ sed -e 's/192.168.8.1:5000/<your registry>/' deploy/kubernetes-1.12/pmem-csi-direct.yaml | kubectl create -f -
The deployment yaml file uses the registry address for the QEMU test cluster setup (see below). When deploying on a real cluster, some registry that can be accessed by that cluster has to be used.
The deploy
directory contains one directory or symlink for each
tested Kubernetes release. The most recent one might also work on
future, currently untested releases.
- Define a storage class using the driver
$ kubectl create -f deploy/kubernetes-<kubernetes version>/pmem-storageclass.yaml
- Provision a pmem-csi volume
$ kubectl create -f deploy/kubernetes-<kubernetes version>/pmem-pvc.yaml
- Start an application requesting provisioned volume
$ kubectl create -f deploy/kubernetes-<kubernetes version>/pmem-app.yaml
The application uses storage: pmem in its nodeSelector list to ensure that it runs on the right node.
- Once the application pod is in 'Running' status, check that it has a pmem volume
$ kubectl get po my-csi-app
NAME READY STATUS RESTARTS AGE
my-csi-app 1/1 Running 0 1m
$ kubectl exec my-csi-app -- df /data
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/ndbus0region0/7a4cc7b2-ddd2-11e8-8275-0a580af40161
8191416 36852 7718752 0% /data
Use the make test
command.
Note: Testing code is not completed yet. Currently it runs some passes using gofmt, go vet
.
The driver can be verified in the single-host context. This running mode is called "Unified" in the driver. Both Controller and Node service run combined in local host, without Kubernetes context.
The endpoint for driver access can be specified either:
- with each csc command as
--endpoint tcp://127.0.0.1:10000
- export endpoint as env.variable, see
util/lifecycle-unified.sh
These run-time dependencies are used by the plugin in Unified mode:
- lvm2
- shred
- mount
- file
- blkid
- lifecycle-unified example steps verifying a volume lifecycle
- sanity-unified API test using csi-sanity
- get-capabilities-unified Query Controller and Node capabilities
These utilities are required by scripts residing in util/
directory:
- csc v0.5.0
- csi-sanity v0.2.0-1-95-g3bc4135
E2E testing relies on a cluster running inside multiple QEMU virtual
machines. This is known to work on a Linux development host system.
The qemu-system-x86_64
binary must be installed, either from
upstream QEMU or the Linux distribution.
For networking, the ip
tool from the iproute2
package must be
installed. The following command must be run once after booting the
host machine and before starting the virtual machine:
test/runqemu-ifup 4
This configures four tap devices for use by the current user. At the moment, the test setup uses:
pmemtap0/1/2/3
pmembr0
- 192.168.8.1 for the build host side of the bridge interfaces
- 192.168.8.2/4/6/8 for the virtual machines
- the same DNS server for the virtual machines as on the development host
It is possible to configure this by creating one or more files ending
in .sh
(for shell) in the directory test/test-config.d
and setting
shell variables in those files. For all supported options, see
test-config.sh.
To undo the configuration changes made by test/runqemu-ifup
when
the tap device is no longer needed, run:
test/runqemu-ifdown
KVM must be enabled and the user must be allowed to use it. Usually this
is done by adding the user to the kvm
group. The
"Install QEMU-KVM"
section in the Clear Linux documentation contains further information
about enabling KVM and installing QEMU.
To ensure that QEMU and KVM are working, run this:
make _work/clear-kvm-original.img _work/start-clear-kvm _work/OVMF.fd
cp _work/clear-kvm-original.img _work/clear-kvm-test.img
_work/start-clear-kvm _work/clear-kvm-test.img
The result should be a login prompt like this:
[ 0.049839] kvm: no hardware support
clr-c3f99095d2934d76a8e26d2f6d51cb91 login:
The message about missing KVM hardware support comes from inside the virtual machine and indicates that nested KVM is not enabled. This message can be ignored because it is not needed.
Now the running QEMU can be killed and the test image removed again:
killall qemu-system-x86_64 # in a separate shell
rm _work/clear-kvm-test.img
reset # Clear Linux changes terminal colors, undo that.
The clear-kvm
images are prepared automatically by the Makefile. By
default, four different images are prepared. Each image is pre-configured with
its own hostname and with network settings for the corresponding tap
device. clear-kvm.img
is a symlink to the clear-kvm.0.img
where
the Kubernetes master node will run.
The images will contain the latest Clear Linux OS and have the Kubernetes version supported by Clear Linux installed.
make start
will bring up a Kubernetes test cluster inside four QEMU
virtual machines. It can be called multiple times in a row and will
attempt to bring up missing pieces each time it is invoked.
Once it completes, everything is ready for interactive use via
kubectl
inside the virtual machine. Alternatively, you can also
set KUBECONFIG
as shown at the end of the make start
output
and use a local kubectl
binary.
The first node is the Kubernetes master without persistent memory. The other
three nodes are worker nodes with one 32GB NVDIMM each. The worker nodes have
already been labeled with storage=pmem
, but the pmem-csi driver still needs to be installed manually as shown in ["Run as Kubernetes deployment"](#run-as-
kubernetes-deployment). If the Docker registry runs on the local development
host, then the sed
command which replaces the Docker registry is not needed.
Once done, make stop
will clean up the cluster and shut everything down.
make test_e2e
will run
csi-test sanity
tests and some
Kubernetes storage tests
against the pmem-csi driver.
The driver will get deployed automatically and thus must not be installed yet on the cluster.
When ginkgo is installed, then it can be used to run individual tests and to control additional aspects of the test run. For example, to run just the E2E provisioning test (create PVC, write data in one pod, read it in another) in verbose mode:
$ KUBECONFIG=$(pwd)/_work/clear-kvm-kube.config REPO_ROOT=$(pwd) ginkgo -v -focus=pmem-csi.*should.provision.storage.with.defaults ./test/e2e/
Nov 26 11:21:28.805: INFO: The --provider flag is not set. Treating as a conformance test. Some tests may not be run.
Running Suite: PMEM E2E suite
=============================
Random Seed: 1543227683 - Will randomize all specs
Will run 1 of 61 specs
Nov 26 11:21:28.812: INFO: checking config
Nov 26 11:21:28.812: INFO: >>> kubeConfig: /nvme/gopath/src/github.com/intel/pmem-csi/_work/clear-kvm-kube.config
Nov 26 11:21:28.817: INFO: Waiting up to 30m0s for all (but 0) nodes to be schedulable
...
Ran 1 of 61 Specs in 58.465 seconds
SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 60 Skipped
PASS
Ginkgo ran 1 suite in 1m3.850672246s
Test Suite Passed
It is also possible to run just the sanity tests until one of them fails:
$ REPO_ROOT=`pwd` ginkgo '-focus=sanity' -failFast ./test/e2e/
...
Report a bug by filing a new issue.
Contribute by opening a pull request.
Learn about pull requests.