Error: storageclass.storage.k8s.io "fast-disks" not found
sunkararp opened this issue · 12 comments
What happened:
- K8S version 1.20.7
- Deployed a new AKS cluster, used Standard_L8s_v2 VM SKUE. This SKU has 2 TB of NVMe storage
- Followed step in Azure/kubernetes-volume-drivers/local
- Nothing shows up when I run this command
kubectl get pv
- Below is the error noticed
E0825 22:32:23.367737 1 discovery.go:220] Failed to discover local volumes: failed to get ReclaimPolicy from storage class "fast-disks": storageclass.storage.k8s.io "fast-disks" not found
This error disappeared after I did below command
kubectl apply -f local-pv-storageclass.yaml -n kube-system
But now I have below new error...
Normal NotTriggerScaleUp 4m17s cluster-autoscaler pod didn't trigger scale-up: 2 node(s) didn't match Pod's node affinity, 1 node(s) didn't find available persistent volumes to bind
I my StatefulSet I have this for affinity
``
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- XXXXX
topologyKey: kubernetes.io/hostname
@sunkararp could you remove the affinity setting?
I'm getting this error now.
Normal NotTriggerScaleUp 56s cluster-autoscaler pod didn't trigger scale-up: 2 node(s) didn't match Pod's node affinity, 1 node(s) didn't find available persistent volumes to bind
fyi... we have 3 scale units
Here is my complete StatefulSet creation yaml file
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: test-rp-search-data
labels:
heritage: "Helm"
release: "es-data"
chart: "elasticsearch"
app: "test-rp-search-data"
annotations:
esMajorVersion: "7"
spec:
serviceName: test-rp-search-data-headless
selector:
matchLabels:
app: "test-rp-search-data"
replicas: 5
podManagementPolicy: Parallel
updateStrategy:
type: RollingUpdate
volumeClaimTemplates:
- metadata:
name: pvc-localdisk
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 200Gi
storageClassName: fast-disks
template:
metadata:
name: "test-rp-search-data"
labels:
release: "es-data"
chart: "elasticsearch"
app: "test-rp-search-data"
annotations:
spec:
securityContext:
fsGroup: 1000
runAsUser: 1000
serviceAccountName: "test-rp-search-data"
nodeSelector:
agentpool: data
terminationGracePeriodSeconds: 120
volumes:
# Currently some extra blocks accept strings
# to continue with backwards compatibility this is being kept
# whilst also allowing for yaml to be specified too.
- name: localdisk
persistentVolumeClaim:
claimName: pvc-localdisk
enableServiceLinks: true
initContainers:
- name: configure-sysctl
securityContext:
runAsUser: 0
privileged: true
image: "docker.elastic.co/elasticsearch/elasticsearch:7.12.0"
imagePullPolicy: "IfNotPresent"
command: ["sysctl", "-w", "vm.max_map_count=262144"]
resources:
{}
containers:
- name: "elasticsearch"
securityContext:
capabilities:
drop:
- ALL
runAsNonRoot: true
runAsUser: 1000
image: "docker.elastic.co/elasticsearch/elasticsearch:7.12.0"
imagePullPolicy: "IfNotPresent"
readinessProbe:
exec:
command:
- sh
- -c
- |
#!/usr/bin/env bash -e
# If the node is starting up wait for the cluster to be ready (request params: "wait_for_status=green&timeout=1s" )
# Once it has started only check that the node itself is responding
START_FILE=/tmp/.es_start_file
# Disable nss cache to avoid filling dentry cache when calling curl
# This is required with Elasticsearch Docker using nss < 3.52
export NSS_SDB_USE_CACHE=no
http () {
local path="${1}"
local args="${2}"
set -- -XGET -s
if [ "$args" != "" ]; then
set -- "$@" $args
fi
if [ -n "${ELASTIC_USERNAME}" ] && [ -n "${ELASTIC_PASSWORD}" ]; then
set -- "$@" -u "${ELASTIC_USERNAME}:${ELASTIC_PASSWORD}"
fi
curl --output /dev/null -k "$@" "http://127.0.0.1:9200${path}"
}
if [ -f "${START_FILE}" ]; then
echo 'Elasticsearch is already running, lets check the node is healthy'
HTTP_CODE=$(http "/" "-w %{http_code}")
RC=$?
if [[ ${RC} -ne 0 ]]; then
echo "curl --output /dev/null -k -XGET -s -w '%{http_code}' \${BASIC_AUTH} http://127.0.0.1:9200/ failed with RC ${RC}"
exit ${RC}
fi
# ready if HTTP code 200, 503 is tolerable if ES version is 6.x
if [[ ${HTTP_CODE} == "200" ]]; then
exit 0
elif [[ ${HTTP_CODE} == "503" && "7" == "6" ]]; then
exit 0
else
echo "curl --output /dev/null -k -XGET -s -w '%{http_code}' \${BASIC_AUTH} http://127.0.0.1:9200/ failed with HTTP code ${HTTP_CODE}"
exit 1
fi
else
echo 'Waiting for elasticsearch cluster to become ready (request params: "wait_for_status=green&timeout=1s" )'
if http "/_cluster/health?wait_for_status=green&timeout=1s" "--fail" ; then
touch ${START_FILE}
exit 0
else
echo 'Cluster is not yet ready (request params: "wait_for_status=green&timeout=1s" )'
exit 1
fi
fi
failureThreshold: 3
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 3
timeoutSeconds: 5
ports:
- name: http
containerPort: 9200
- name: transport
containerPort: 9300
resources:
limits:
cpu: 3000m
memory: 32Gi
requests:
cpu: 3000m
memory: 32Gi
env:
- name: node.name
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: discovery.seed_hosts
value: "test-rp-search-master-headless"
- name: cluster.name
value: "test-rp-search"
- name: network.host
value: "0.0.0.0"
- name: ES_JAVA_OPTS
value: "-Xmx16g -Xms16g"
- name: node.data
value: "true"
- name: node.ingest
value: "false"
- name: node.master
value: "false"
volumeMounts:
- name: localdisk
mountPath: /mnt/localdisk/elasticsearch/data
readOnly: false
does the example here work?
Yes, your example works like charm.
One thing to note though, NVMe disk was not freed if I delete example deployment.
If reclaimPolicy is set as Delete in local volume storage class, data will be cleaned up after PVC deleted, local volume PV would be in Released status, after around 5min by default, PV status would be changed to Bound, user could tune minResyncPeriod value to make PV status refresh more quickly.
thanks, but the really issues why we are not able to bind to Pvs
We added 'allowedTopologies' like below
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast-disks
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer # Immediate is not supported
reclaimPolicy: Delete # available values: Delete, Retain
allowedTopologies:
- matchLabelExpressions:
- key: failure-domain.beta.kubernetes.io/zone
values:- westus2-1
- westus2-2
- westus2-3
- key: failure-domain.beta.kubernetes.io/zone
This is the latest error message...
Normal NotTriggerScaleUp 6s cluster-autoscaler pod didn't trigger scale-up: 2 node(s) didn't match Pod's node affinity, 1 node(s) didn't find available persistent volumes to bind
could you remove allowedTopologies
, every local disk PV could only be bound to current node.
per the error msg, 2 nodes are not in the above allowedTopologies, and only node is in westus2-x zone, and one node only have one local disk PV.
$ k describe StorageClass fast-disks -n cohort-search
Name: fast-disks
IsDefaultClass: No
Annotations: kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"storage.k8s.io/v1","kind":"StorageClass","metadata":{"annotations":{},"name":"fast-disks"},"provisioner":"kubernetes.io/no-provisioner","reclaimPolicy":"Delete","volumeBindingMode":"WaitForFirstConsumer"}
Provisioner: kubernetes.io/no-provisioner
Parameters: <none>
AllowVolumeExpansion: <unset>
MountOptions: <none>
ReclaimPolicy: Delete
VolumeBindingMode: WaitForFirstConsumer
Events: <none>
error
Normal NotTriggerScaleUp 4m8s cluster-autoscaler pod didn't trigger scale-up: 2 node(s) didn't match Pod's node affinity, 1 node(s) didn't find available persistent volumes to bind
2 node(s) didn't match Pod's node affinity
, can you run
kubectl get no --show-labels
kubectl get pv | grep local-pv
and then showkubectl get pv local-pv-xxx -o yaml