red-hat-storage/ocs-operator

Adding capacity to an existing Storage Cluster fails silently on OKD4.6 ๐Ÿ˜ญ

Closed this issue ยท 2 comments

Hi we are trying to add new nodes to an OpenShift Container Storage operator already installed on our OKD 4.6 cluster

We managed to install OCS on 6 nodes 6 months ago through the Catalog web UI of our OKD cluster (which was 4.5 at the time)

We are now following the same procedure as we did before (docs for OpenShift 4.5) : https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.5/html/managing_openshift_container_storage/scaling-storage-nodes_rhocs

We also checked other documentations produced by RedHat (all documentations are incomplete and miss crucial details, so we need to retrieve informations from multiple places)

Our current Server Version is 4.6.0-0.okd-2021-01-23-132511
The OCS operator version is 4.8.0

Installing the "local storage operator" (version 4.6.0) works without any issue and we can see that a localvolumeset-local-provisioner is up and running for the new node we want to add

But when we try to Add Capacity to the already installed OpenShift Container Storage we got an issue, or worst: we don't get any error and the operation just fail (because the Operator system RedHat is trying to push since a few years actually does not work, and it brings more problem than solutions to the "hybrid" cloud ecosystem because it only focuses on making deployment work with AWS, GCP and Azure)

  1. We go in Administrator > Operator > Installed Operators > OpenShift Container Storage > Storage Cluster tab > click on the storage cluster already created (which works), named ocs-storagecluster

    For this Storage Cluster, we then click on Actions and Add Capacity, 2 classes are available: ceph-hdd and ocs-storagecluster-ceph-rgw

  2. We are using ceph-hdd , when we choose it the UI tells us: Available capacity: 149.7 TiB / 3 replicas

  3. Then we are clicking on Add OKD seems like it succeeded since it close the window to add capacity, but the new node is not added to the storage cluster! And node pods a created on this node to handle the Persistent Volumes (e.g. rook-ceph-osd pods).

The official RedHat documentation for 4.5 and 4.6 was "Click Add and wait for the cluster state to change to Ready." without any additional details, or case the operation is failing (maybe it is expected to work flawlessly?)

The Events are not showing any error related to this operation (the node we are trying to add is the c-0016 that you can see on the screenshot)

Screenshot from 2021-05-19 16-12-36

We noticed that out OKD cluster version (4.6.0) is not the same as the OCS-operator version (4.8.0, it was installed as 4.6.0 but got upgraded). There is only 1 installation channel for the OCS-operator though and it points to 4.8.0, is it possible to downgrade the operator version without loosing existing claims?

Do you have any idea how we could debug this issue? Maybe you already know how it can be fixed?

Or is there some guidelines to debug the Operator? Because it is not clear how the operator can be customized and adapted to a specific cluster (bare metal).

Then we are clicking on Add OKD seems like it succeeded since it close the window to add capacity, but the new node is not added to the storage cluster! And node pods a created on this node to handle the Persistent Volumes (e.g. rook-ceph-osd pods).

The issue is that OSDs get created, but not on the new nodes? Have you labeled the new nodes?
Also please paste the storagecluster CR's yaml.

Also, the UI for OKD 4.6 might not necessarily be compatible with OCS 4.8 which is unreleased. We don't recommend using our dev channels in production. 4.8 is not yet stable.

If you want to build and deploy OCS 4.6, you can follow the build instructions in the readme and build the release-4.6 branch.

Thanks a lot for this feedback @rohantmp

We had a call with @jarrpa about this issue

As you said it is due to the fact that we installed OCS about a year ago when it was not really released (so we installed it with what was the dev channel at the time)

And we have now 2 catalog source for the OCS operator (the current "official" one from RedHat, and the dev one we installed a year ago)

Our options will be to:

  • Upgrade our cluster to 4.7.0 or 4.8.0
  • Uninstall the OCS operator (dev channel) and reinstall it using the "official" operator in OperatorHub

I am closing this issue for now, we will open a new one if we get other issues down the road

Thanks for the help!