daviswr/ZenPacks.daviswr.ZFS

REMOVED disk not correctly detected by zpool component

Closed this issue · 29 comments

We use /dev/disk/by-id/ paths referencing the ata- or scsi-/sas- symlinks pointing to our devices in our zpool configurations. Just noticed a pool lost a drive, OS removed it altogether at fail-time, and the zenpack is having some issues with this.
The disk is still showing as online, but there is a warning message generated saying:

Component:  raidz1-0
Event Class:    /Cmd/Fail
Status:     New
Message:    Traceback (most recent call last):
  File "/opt/zenoss/Products/ZenRRD/zencommand.py", line 819, in _processDatasourceResults
    parser.processResults(datasource, results)
  File "/opt/zenoss/packs/ZenPacks.daviswr.ZFS/ZenPacks/daviswr/ZFS/parsers/zpool/status.py", line 68, in processResults
    health = pool_match.groups()[0]
AttributeError: 'NoneType' object has no attribute 'groups'

I'm assuming a problem in the zpool status output parser.

As a result, the disk itself is not marked as being offline in Zenoss, but the VDEV does show yellow (warning state) due to the parsing problem resulting in the reference to a 'NoneType' object.

One additional note on this is that devices comprising the vdevs (/dev/disk/by-id/ata-XXX-YYY) are used as full disks. However, the plugin detects /dev/disk/by-id/ata-XXX-YYY-part1 as the member, wheras the zpool properly shows ata-XXX-YYY under status. This could explain why its not finding the disk if searching for a -part1 if the OS removed the base path preceding that.

EDIT: we didnt catch this prior as our other zpools use dm-crypt devices as their backing stores, zdb -L on other systems shows /dev/disk/by-id/dm-name-... whereas on the raw disks zdb -L actually shows the -part1 suffix. Still not sure why this is causing the zenpack to report the offline disk as ready though...

I think I've fixed the problem in the zpool status parser and the zpool modeler should drop the partition/slice number from the device name if it's a whole disk. I was initially resigned to not do vdev templates, at least not yet, due to the naming difference in zdb vs zpool output, but I guess my half-baked health checks changed that... I need to redo the zpool status parser, though. Can probably do vdev I/O graphs soon, too, as a result.

Is your failed disk marked as REMOVED, UNAVAIL, or not present at all in zpool status?

Thanks, the failed disk is marked as REMOVED.

Also interesting that a pool which is in the SUSPENDED state due to too many failures show up as still being up (though the health threshold error works fine).

If you're talking about the Status line on the component's Details, that seems to be controlled by the presence of an event of class /Status for that component. A /Status event will change it from Up to Down, but I'm not sure how to exert any finer-grained control over it. Poking around in zendmd, components have getStatus() and getStatusString() methods but not a corresponding setStatus().

I need to create some event transforms for this ZenPack yet, especially to make the health status meaningful. Aside from different severity levels, I could re-class the events that need to mark a component as truly "down" (or not).

@daviswr: any chance you figured out the transforms and vdev status bit in a private branch somewhere? Seems pool status and vdev status always come back as 0/ONLINE despite pools and their vdevs showing as degraded. I tried editing the zpool status parser, the way its written it seems like it'll bail on a pools VDEV members at first match on status if the input is a full zpool status -v

Unfortunately 5c6f23e did not fix the issue:
image

Indeed not. Hm. The expected value's getting into the RRD, but the threshold isn't working as expected. I'm investigating.

Thanks as always sir.

I havent looked into the sources in a while, but if they parse the error counters on status output, that might be another thing to look for - if not /\s+0\s+0\s+0$/ sort of approach to checking VDEV status as a backup?

I think I've got a solution but am doing some more testing (something the last commit sorely lacked if I'm honest...) before a commit.
Hadn't previously considered the error counters, but that's good data to collect.

Much appreciated, will keep an eye on the tab when you need another tester

Could I get your opinion on the severity mappings for states?

    # https://docs.oracle.com/cd/E19253-01/819-5461/gamno/index.html
    # https://docs.oracle.com/cd/E19253-01/819-5461/gcvcw/index.html
    severities = {
        # The device or virtual device is in normal working order
        'ONLINE': SEVERITY_CLEAR,
        # Available hot spare
        'AVAIL': SEVERITY_CLEAR,
        # Hot spare that is currently in use
        'INUSE': SEVERITY_INFO,
        # The virtual device has experienced a failure but can still function
        'DEGRADED': SEVERITY_WARNING,
        # The device or virtual device is completely inaccessible
        'FAULTED': SEVERITY_CRITICAL,
        # The device has been explicitly taken offline by the administrator
        'OFFLINE': SEVERITY_WARNING,
        # The device or virtual device cannot be opened
        'UNAVAIL': SEVERITY_CRITICAL,
        # The device was physically removed while the system was running
        'REMOVED': SEVERITY_ERROR,
        'SUSPENDED': SEVERITY_ERROR,
        }

I think that i'd map any severity where the disk is inoperable as critical - Zenoss has a tendency to rate a lot of things as error which sort of creates alert fatigue for the class.
OFFLINE & DEGRADED i'd make into an error
REMOVED i'd promote to critical, ditto SUSPENDED
i think AVAIL should probably be info so we know when we have spares hanging out.

Give d24d30a a try

Thank you - success.

There may be an edge-case or two. After a manual model run, it shows the state of some VDEVs on some systems as TBD
image

The vast majority are working correctly now - detecting root VDEV and leaf VDEV states, alerting, the works.

EDIT: seems that the ZFS command parser is failing on that one host - its incorectly reading a single snapshot to be all of the datasets (snapshots are set to be ignored for all of these), so root and disk VDEVs are in TBD status and the dataset view is confused.

TBD's a temporary state for VDevs until the real value's polled. zdb doesn't list their health state and due to how the zpool modeler's written, it'd be, shall we say, cumbersome to get vdev health at model time.

I haven't touched any of the dataset-related stuff. Was that host working before?

It was working before, but after modeling they all seem to be setting themselves to TBD. The datasets issue was a bookmark showing up as a ZFS dataset and seemingly unrelated to the TBD thing. Whats weird is that right after installing the updated plugin VDEVs were showing state, but now i'm seeing TBD on a bunch. Any chance they reset from a modeling pass?

Yeah, right now all vdevs will reset to TBD temporarily after modelling. Zenoss seemed to want something in that field for the property to display. Default cycle time for command polling's 60 seconds, they should pick up the real value soon.

In typing this I might've thought of a couple of things to play with, but basically I'm going to have to spend a lot of "quality time" with the zpool modeler to fix it properly. Time/effort tradeoff and all.

Unfortunately they're not coming back from TBD state, after nearly 30m.

Do non-ONLINE ones show up? If it's just devices that should be ONLINE that are showing as TBD, I think I know what it is. The health threshold right now doesn't trip for online (only as clearing a non-online state), so the call to update the model never gets reached if it wasn't a different (non-TBD) state beforehand. I should've caught that.

I think that theory makes sense, i've reenabled monitoring for a dead VDEV (tick 'em off unless they're data critical once a case is created to deal with ti) to verify.

Yeah, DEGRADED VDEVs show up, good ones are TBD. Waiting for a critical one to cycle through for the leaf VDEV.

Confirm: all "not doing well" states show up, all good states are TBD

I took a look over the last few commits, and have a question - if component health is already set to whatever the last polling interval provided, could the modeling process skip updating it when the value with which to update is TBD?

Tried a few different things and came up with what I probably should have done in the first place for displaying current health. Whew. affe923 - display looks at the actual current datapoint rather than relying on the transform to update the model. Transform still does, though; it's just not displaying that particular field.

Added bonuses are error graphs and scrub/resilver events.

I don't know if there's a way to get a current attribute or datapoint value during modeling.

They're showing online again, graphs work, offline/failed devices work. So far so good.
Want to close this out and i'll open new ones as i find any latent issues?

Works for me.