librenms/librenms-agent

better nvme support

evanrich opened this issue · 6 comments

Latest commit for SMART says add support for nvme, seeing how it should pull power cycles, but it doesn't seem to work on optane drives, despite having the same wording:

smartctl -A /dev/nvme0
smartctl 7.0 2018-12-30 r4883 [FreeBSD 11.3-RELEASE-p9 amd64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF SMART DATA SECTION ===
SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        41 Celsius
Available Spare:                    100%
Available Spare Threshold:          0%
Percentage Used:                    0%
Data Units Read:                    12,602,051 [6.45 TB]
Data Units Written:                 459,682,008 [235 TB]
Host Read Commands:                 125,287,911
Host Write Commands:                6,911,909,418
Controller Busy Time:               2,268
Power Cycles:                       23
Power On Hours:                     15,577
Unsafe Shutdowns:                   2
Media and Data Integrity Errors:    0
Error Information Log Entries:      0

when you run it though, it doesn't output "Power Cycles" as 23:

nvme0,null,null,null,null,null,null,null,null,null,41,null,null,null,null,null,null,0,0,0,0,0,0,0,0,15577

only the temperature (41) and power on hours (15577) get recorded

is it possible to also add available spare/percentage used as well?

here's the full smartctl output if it helps:

smartctl -a /dev/nvme0
smartctl 7.0 2018-12-30 r4883 [FreeBSD 11.3-RELEASE-p9 amd64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       INTEL SSDPED1D280GA
Serial Number:                      PHMB7392002X280CGN
Firmware Version:                   E2010435
PCI Vendor/Subsystem ID:            0x8086
IEEE OUI Identifier:                0x5cd2e4
Controller ID:                      0
Number of Namespaces:               1
Namespace 1 Size/Capacity:          280,065,171,456 [280 GB]
Namespace 1 Formatted LBA Size:     512
Local Time is:                      Thu Jul 23 23:54:24 2020 PDT
Firmware Updates (0x02):            1 Slot
Optional Admin Commands (0x0007):   Security Format Frmw_DL
Optional NVM Commands (0x0006):     Wr_Unc DS_Mngmt
Maximum Data Transfer Size:         32 Pages

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +    18.00W       -        -    0  0  0  0        0       0

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         2

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        41 Celsius
Available Spare:                    100%
Available Spare Threshold:          0%
Percentage Used:                    0%
Data Units Read:                    12,602,051 [6.45 TB]
Data Units Written:                 459,685,948 [235 TB]
Host Read Commands:                 125,287,911
Host Write Commands:                6,911,939,367
Controller Busy Time:               2,268
Power Cycles:                       23
Power On Hours:                     15,577
Unsafe Shutdowns:                   2
Media and Data Integrity Errors:    0
Error Information Log Entries:      0

Error Information (NVMe Log 0x01, max 64 entries)
No Errors Logged

feel free to write a patch,
also for enhancement of the Space percentage in librenms, but i don't know what exactly it measures.

arrmo commented

Hi,

I wouldn't mind taking a run at this, but to confirm a couple things first 😄,

  • is the output format captured anywhere, to know what parameters are being output?
  • thinking that for NVMe, TBW is a key item ... but it should be compared against some threshold (i.e. warranty limit, to get percentage of life used). Agreed? How / where to store that info?

Thanks!

for Details what position means take a look on librenms ...

https://github.com/librenms/librenms/blob/master/includes/polling/applications/smart.inc.php

for new values you have to write Graphs and enhance polling

hope this helps

arrmo commented

It does help - thanks! But a thought 😆. Please do let me know what you think as well, before I launch into this - avoid going down a wrong path.

Thinking that we really don't want to create new (different) metrics, as it makes comparing SSD's (SATA vs. NVMe) rather difficult / painful. And really, the metrics should be similar, agreed?

Looking at the list that is in the link you provided, and as well some info on Wikipedia, I'm thinking to "map" some NVMe outputs (sample log below) to SMART parameters, to align with the current data.

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        28 Celsius
Available Spare:                    100%
Available Spare Threshold:          5%
Percentage Used:                    4%
Data Units Read:                    9,389,938 [4.80 TB]
Data Units Written:                 43,616,415 [22.3 TB]
Host Read Commands:                 200,946,334
Host Write Commands:                397,460,540
Controller Busy Time:               655
Power Cycles:                       188
Power On Hours:                     11,566
Unsafe Shutdowns:                   151
Media and Data Integrity Errors:    0
Error Information Log Entries:      228
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

So then, "mapping",

Temperature => 194 Temperature Celsius
Percentage Used => 231 Life Left (actually, 1-Percentage Used) ... or calculated based on TBW limit (but then, need this).

Thoughts on this?

Thanks again.

that's the way it should be.
use existing Values as possible and for new things use new.
Don't forget to create graph templates for smart agent in LibreNMS to get them graphed