openculinary/infrastructure

Restore redundancy for persistence path storage (/mnt/persistence)

Closed this issue · 2 comments

Describe the bug
The Linux RAID array for our persistent storage path (/mnt/persistence) has been degraded for some time - likely months (!).

This means we're running without a safety margin for the service database in particular. Having backups helps a little bit.. but we should still make restoration of redundancy a priority.

It seems that a missing/offline disk is the root cause of the problem.

To Reproduce
Steps to reproduce the behavior:

  1. Login to the persistent storage host.
  2. At the command-line, run sudo mdadm --detail /dev/md0.
  3. Observe that the array is incomplete and in a degraded state.

Expected behavior
The persistent storage RAID array should be fully populated, providing data storage redundancy.

The server contains two disk bays, and two 1T drives within those that should compose a RAID-1 mirror - that is, they should both contain the same data and appear as one md device.

One of the disks was not appearing under /dev/sd* on the machine, and this wasn't obvious externally; the LED light on the relevant drive tray was on, although it did not indicate any activity.

I've relocated some of the disks, and identified the physical location of the missing 1TB drive. After hot-swapping it into an alternative tray, it is now visible to the host. I've re-added it to the md0 array, and it was immediately accepted, which I think indicates that it still had the appropriate array labels (meaning: it was indeed the expected and correct disk to re-add).

The state of the array is now clean, degraded, recovering according to mdadm --detail /dev/md0, and the rebuild status is ~20% complete.

Recovery of the persistent data array is complete: it has status clean, and redundancy (disk failure tolerance) has been restored.