Intermittent access issues with NFS Volumes

Question

Intermittent access issues with NFS Volumes

erkerb4 opened this issue 3 years ago · 0 comments

Is this a bug report or feature request?

Bug Report

Deviation from expected behavior:
I've deployed rook-nfs using the quick-start guide, then followed create and initialize nfs server section to establish two nfs-servers. One NFS-Server was backed by HDD, and the other NFS-Server is backed by a SSD storage. The operator built the NFS servers succesfully.

Next thing I did was to create a deployment, and create PVC which used the SC for the NFS Server. When the pod first started, it created the PV fine, and binded correctly in the pod. Everything worked expected for a little white (like a week maybe?) Then all of a sudden, the pods were unable to access the volumes anymore. It would just hang, if you would open a shell and do 'ls' on the nfs volume.

When I restarted the pod that has the NFS volume, the pod failed to start. The pod never passes the "init" stage. Eventually, it will error out because it is unable to mount the volume that is backed by the NFS server.

I've attempted to restart all the nodes, try to schedule the pod on another node, but issue persists.

The only way I was able to get the pod to mount the volume again is to change the volume spec from PVC to NFS in the deployment:

      volumes:
      - name: gold-nfs-mount
        nfs:
          path: /gold-scratch/dir <--- Export 
          server: 172.30.17.118 <--- Service IP address of NFS Server

The weird thing was, this has happened one more time before, and the problem eventually went away. By itself.

Expected behavior:
Be able to continue to use persistentVolumeClaim for the volume instead of using nfs to mount volumes.

How to reproduce it (minimal and precise):
Create rook-nfs operator using the quick-start guide, then follow create and initialize nfs server section to establish nfs-servers.

To make it easier, this is my manifest:

Persistent Volume:

---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: gold-scratch
  labels:
    type: ssd
spec:
  storageClassName: local-storage
  capacity:
    storage: 200Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteMany
  hostPath:
    path: "/mnt/scratch/gold"
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
          - node1
          - node2

PVC + NFS Server

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: gold-scratch
  namespace: rook-nfs
spec:
  storageClassName: "local-storage"
  accessModes:
  - ReadWriteMany
  selector:
    matchLabels:
      type: ssd
  resources:
    requests:
      storage: 200Gi
---
apiVersion: nfs.rook.io/v1alpha1
kind: NFSServer
metadata:
  name: gold-nfs
  namespace: rook-nfs
spec:
  replicas: 1
  exports:
  - name: gold-scratch
    server:
      accessMode: ReadWrite
      squash: "none"
    persistentVolumeClaim:
      claimName: gold-scratch
  annotations:
    rook-nfs: gold-scratch 
    rook: nfs

StorageClass:

---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  labels:
    rook-nfs: gold-scratch
    type: ssd
  name: gold-local
parameters:
  exportName: gold-scratch
  nfsServerName: gold-nfs
  nfsServerNamespace: rook-nfs
provisioner: nfs.rook.io/gold-nfs-provisioner
reclaimPolicy: Delete
volumeBindingMode: Immediate

Verify:

$ kubectl get pods -n rook-nfs --selector=app=gold-nfs
NAME         READY   STATUS    RESTARTS         AGE
gold-nfs-0   2/2     Running   16 (4d20h ago)   5d13h

$ kubectl get sc gold-local
NAME         PROVISIONER                        RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE
gold-local   nfs.rook.io/gold-nfs-provisioner   Delete          Immediate           false                  33d

Deploy an app, and use gold-local SC to for PVC. And Wait?

File(s) to submit:

The NFS Server does not show any errors.

Environment:

OS (e.g. from /etc/os-release): Ubuntu 20.04.3 LTS
Kernel (e.g. uname -a): Linux 5.11.0-43-generic
Cloud provider or hardware configuration: N/A On-Prem
Rook version (use rook version inside of a Rook Pod): Rook NFS 1.7.3
Storage backend version (e.g. for ceph do ceph -v): Rook NFS 1.7.3
Kubernetes version (use kubectl version): v1.23.1
Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): kubeadm