/check_hpasm

A plugin (monitoring-plugin, not nagios-plugin, see also http://is.gd/PP1330) which checks the hardware health of HP Proliant Servers. (May also be used for other devices which implement the CPQHLTH mib)

Primary LanguagePerlGNU General Public License v2.0GPL-2.0

check_hpasm Nagios Plugin README
---------------------

This plugin checks the hardware health of HP Proliant servers with the 
hpasm software installed. It uses the hpasmcli command to acquire the 
condition of the system's critical components like cpus, power supplies,
temperatures, fans and memory modules. Newer versions also use SNMP.

* For instructions on installing this plugin for use with Nagios,
  see below. In addition, generic instructions for the GNU toolchain
  can be found in the INSTALL file.

* For major changes between releases, read the CHANGES file.

* For information on detailed changes that have been made,
  read the Changelog file.

* This plugins is self documenting.  All plugins that comply with
  the basic guidelines for development will provide detailed help when
  invoked with the '-h' or '--help' options.

You can check for the latest plugin at:
  http://www.consol.de/opensource/nagios/check-hpasm

Send mail to gerhard.lausser@consol.de for assistance.  
Please include the OS type and version that you are using.
Also, run the plugin with the '-v' option and provide the resulting 
version information.  Of course, there may be additional diagnostic information
required as well.  Use good judgment.


How to "compile" the check_hpasm script.
--------------------------------------------------------

1) Run the configure script to initialize variables and create a Makefile, etc.

	./configure --prefix=BASEDIRECTORY --with-nagios-user=SOMEUSER --with-nagios-group=SOMEGROUP --with-perl=PATH_TO_PERL --with-noinst-level=LEVEL --with-degrees=UNIT --with-perfdata --with-hpacucli

   a) Replace BASEDIRECTORY with the path of the directory under which Nagios
      is installed (default is '/usr/local/nagios')
   b) Replace SOMEUSER with the name of a user on your system that will be
      assigned permissions to the installed plugins (default is 'nagios')
   c) Replace SOMEGRP with the name of a group on your system that will be
      assigned permissions to the installed plugins (default is 'nagios')
   d) Replace PATH_TO_PERL with the path where a perl binary can be found.
      Besides the system wide perl you might have installed a private perl
      just for the nagios plugins (default is the perl in your path).
   e) Replace LEVEL with one of ok, warning, critical or unknown.
      If the required hpasm-rpm is not installed, the check_hpasm plugin
      will exit with the level specified. If you chose ok, the message
      will say "ok - .... hpasm is not installed". This is different from
      the "ok - hardware working fine" if hpasm was found.
      The default is to treat a missing hpasm package as ok.
   f) Replace UNIT with one of celsius or fahrenheit. The hpasmcli "show temp"
      prints temperatures both in units of celsius and fahrenheit. With the
      --with-degrees option you can decide which units will be shown in an
      alarm message.
      The default is "celsius".
   g) You can tell check_hpasm to output performance data by default if
      you call configure with the --enable-perfdata option.
   h) You can tell check_hpasm to check the raid status with the hpacucli command
      if you call configure with the --enable-hpacucli option.
      You need the hpacucli rpm.

2) "Compile" the plugin with the following command:

	make

    This will produce a "check_hpasm" script. You will also find
    a "check_hpasm.pl" which you better ignore. It is the base for
    the compilation filled with placeholders. These will be replaced during
    the make process.


3) Install the compiled plugin script with the following command:

	make install

   The installation procedure will attempt to place the plugin in a 
   'libexec/' subdirectory in the base directory you specified with
   the --prefix argument to the configure script.


4) Verify that your configuration files for Nagios contains
   the correct paths to the new plugin.


5) Add this line to /etc/sudoers:
   nagios      ALL=NOPASSWD: /sbin/hpasmcli
   or ths, if you also installed the hpacu package
   nagios      ALL=NOPASSWD: /sbin/hpasmcli, /usr/sbin/hpacucli
  


Command line parameters
-----------------------

-v, --verbose
   Increased verbosity will print how check_hpasm communicates with the
   hpasm daemon and which values were acquired.

-t, --timeout
   The number of seconds after which the plugin will abort.

-b, --blacklist
   If some components of your system are missing (mostly the secondary
   power supply bay is empty) and you tolerate this, then blacklist the
   missing/failed component to avoid false alarms.
   The value for this option is a slash-separated list of components to
   ignore.
   Example: -b p:1,2/f:2/t:3,4/c:1/d:0-1,0-2
   means: ignore power supplies #1 and #2, fan #2, temperature #3 and #4,
   cpu #1 and dimms #1 and #2 in cartridge #0.

-c, --customthresh
   Override the machine-default temperature thresholds.
   Example: -c 1:60/4:80/5:50
   Sets limit for temperature 1 to 60 degrees, temperature 4 to 80 degrees
   and temperature 5 to 50 degrees. You get the consecutive numbers by
   calling check_hpasm -v
   ...
      checking temperatures
       1 processor_zone temperature is 46 (62 max)
       2 cpu#1 temperature is 43 (73 max)
       3 i/o_zone temperature is 54 (68 max)
       4 cpu#2 temperature is 46 (73 max)
       5 power_supply_bay temperature is 38 (55 max)

-p, --perfdata
   Add performance data to the output even if you did not compile check_hpasm
   with --with-perfdata in step 1.



SNMP and Memory Modules
-----------------------
Older hardware does not always show valuable information when queried for
the health of memory modules. Maybe it's because older modules do not support
error checking at all.


1. no cpqHeResMemModule
---------------------------------------------------------------------------

2. collapsed cpqHeResMemModule
---------------------------------------------------------------------------

Some (older) systems do not support the cpqHeResMemModuleEntry table.
Either there is no oid with 1.3.6.1.4.1.232.6.2.14.11.1 at all
or there is a single oid like

Example:
iso.3.6.1.4.1.232.2.2.4.5.1.3.0.1 = INTEGER: 524288
iso.3.6.1.4.1.232.2.2.4.5.1.3.0.2 = INTEGER: 262144
iso.3.6.1.4.1.232.2.2.4.5.1.3.0.3 = INTEGER: 0
iso.3.6.1.4.1.232.2.2.4.5.1.3.0.4 = INTEGER: 524288
iso.3.6.1.4.1.232.2.2.4.5.1.3.0.5 = INTEGER: 262144
iso.3.6.1.4.1.232.2.2.4.5.1.3.0.6 = INTEGER: 0

                                ^-- module number
                              ^-- cartridge number (0 = system board)
                            ^-- size

iso.3.6.1.4.1.232.6.2.14.11.1.1.0.6 = INTEGER: 0
 
I compared 300 systems and found out that with
1.3.6.1.4.1.232.6.2.14.11.1.<no1>.<no2>.<no3> = <no4>
no1 is always 1
no2 is always 0
no3 is the number of memory slots (including the empty ones).
no4 is always 0. It is probably the health status of the 
overall memory subsystem. I don't know.
I will implement 0 = ok, not 0 = ask compaq

cpqSiMemECCStatus provides no usable information. All my test systems
showed 0 which is an undocumented value.

function get_size(cpqHeResMemModuleEntry) will return 1.

3. cpqHeResMemModule containing crap
---------------------------------------------------------------------------

grepping for cpqSiMemBoardSize shows 4 modules
iso.3.6.1.4.1.232.2.2.4.5.1.3.0.1 = INTEGER: 262144
iso.3.6.1.4.1.232.2.2.4.5.1.3.0.2 = INTEGER: 262144
iso.3.6.1.4.1.232.2.2.4.5.1.3.0.3 = INTEGER: 0
iso.3.6.1.4.1.232.2.2.4.5.1.3.0.4 = INTEGER: 262144
iso.3.6.1.4.1.232.2.2.4.5.1.3.0.5 = INTEGER: 262144
iso.3.6.1.4.1.232.2.2.4.5.1.3.0.6 = INTEGER: 0

grepping for cpqHeResMemEntry shows one module with zero values
iso.3.6.1.4.1.232.6.2.14.11.1.1.0.0 = INTEGER: 0
iso.3.6.1.4.1.232.6.2.14.11.1.2.0.0 = INTEGER: 0
iso.3.6.1.4.1.232.6.2.14.11.1.3.0.0 = ""
iso.3.6.1.4.1.232.6.2.14.11.1.4.0.0 = INTEGER: 0
iso.3.6.1.4.1.232.6.2.14.11.1.5.0.0 = INTEGER: 0
iso.3.6.1.4.1.232.6.2.14.11.1.6.0.0 = Hex-STRING: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 


4. cpqHeResMemModuleEntry and cpqSiMemModuleEntry use different table indexes
---------------------------------------------------------------------------

cpqSiMemBoardIndex      1.3.6.1.4.1.232.2.2.4.5.1.1 
cpqSiMemModuleIndex     1.3.6.1.4.1.232.2.2.4.5.1.2 

cpqHeResMemBoardIndex   1.3.6.1.4.1.232.6.2.14.11.1.1 
cpqHeResMemModuleIndex  1.3.6.1.4.1.232.6.2.14.11.1.2 


cpqSiMemBoardIndex
SNMPv2-SMI::enterprises.232.2.2.4.5.1.1.0.1 = INTEGER: 0
SNMPv2-SMI::enterprises.232.2.2.4.5.1.1.0.2 = INTEGER: 0
SNMPv2-SMI::enterprises.232.2.2.4.5.1.1.0.3 = INTEGER: 0
SNMPv2-SMI::enterprises.232.2.2.4.5.1.1.0.4 = INTEGER: 0
SNMPv2-SMI::enterprises.232.2.2.4.5.1.1.0.5 = INTEGER: 0
SNMPv2-SMI::enterprises.232.2.2.4.5.1.1.0.6 = INTEGER: 0

cpqHeResMemBoardIndex
SNMPv2-SMI::enterprises.232.6.2.14.11.1.1.1.1 = INTEGER: 0
SNMPv2-SMI::enterprises.232.6.2.14.11.1.1.1.2 = INTEGER: 0
SNMPv2-SMI::enterprises.232.6.2.14.11.1.1.1.3 = INTEGER: 0
SNMPv2-SMI::enterprises.232.6.2.14.11.1.1.1.4 = INTEGER: 0
SNMPv2-SMI::enterprises.232.6.2.14.11.1.1.1.5 = INTEGER: 0
SNMPv2-SMI::enterprises.232.6.2.14.11.1.1.1.6 = INTEGER: 0

It is not possible to use the SNMP-table-indices to identify the 
corresponding he-entry. Matching is done with nested loops.

5. even worse: cpqHeResMemBoardIndex and cpqSiMemBoardIndex don't match
---------------------------------------------------------------------------

cpqSiMemBoardIndex
iso.3.6.1.4.1.232.2.2.4.5.1.1.1.1 = INTEGER: 1
iso.3.6.1.4.1.232.2.2.4.5.1.1.1.2 = INTEGER: 1
iso.3.6.1.4.1.232.2.2.4.5.1.1.1.3 = INTEGER: 1
iso.3.6.1.4.1.232.2.2.4.5.1.1.1.4 = INTEGER: 1
iso.3.6.1.4.1.232.2.2.4.5.1.1.1.5 = INTEGER: 1
iso.3.6.1.4.1.232.2.2.4.5.1.1.1.6 = INTEGER: 1
iso.3.6.1.4.1.232.2.2.4.5.1.1.1.7 = INTEGER: 1
iso.3.6.1.4.1.232.2.2.4.5.1.1.1.8 = INTEGER: 1
iso.3.6.1.4.1.232.2.2.4.5.1.1.2.1 = INTEGER: 2
iso.3.6.1.4.1.232.2.2.4.5.1.1.2.2 = INTEGER: 2
iso.3.6.1.4.1.232.2.2.4.5.1.1.2.3 = INTEGER: 2
iso.3.6.1.4.1.232.2.2.4.5.1.1.2.4 = INTEGER: 2
iso.3.6.1.4.1.232.2.2.4.5.1.1.2.5 = INTEGER: 2
iso.3.6.1.4.1.232.2.2.4.5.1.1.2.6 = INTEGER: 2
iso.3.6.1.4.1.232.2.2.4.5.1.1.2.7 = INTEGER: 2
iso.3.6.1.4.1.232.2.2.4.5.1.1.2.8 = INTEGER: 2
iso.3.6.1.4.1.232.2.2.4.5.1.1.3.1 = INTEGER: 3

cpqHeResMemBoardIndex
iso.3.6.1.4.1.232.6.2.14.11.1.1.0.1 = INTEGER: 0
iso.3.6.1.4.1.232.6.2.14.11.1.1.0.2 = INTEGER: 0
iso.3.6.1.4.1.232.6.2.14.11.1.1.0.3 = INTEGER: 0
iso.3.6.1.4.1.232.6.2.14.11.1.1.0.4 = INTEGER: 0
iso.3.6.1.4.1.232.6.2.14.11.1.1.0.5 = INTEGER: 0
iso.3.6.1.4.1.232.6.2.14.11.1.1.0.6 = INTEGER: 0
iso.3.6.1.4.1.232.6.2.14.11.1.1.0.7 = INTEGER: 0
iso.3.6.1.4.1.232.6.2.14.11.1.1.0.8 = INTEGER: 0
iso.3.6.1.4.1.232.6.2.14.11.1.1.1.1 = INTEGER: 1
iso.3.6.1.4.1.232.6.2.14.11.1.1.1.2 = INTEGER: 1
iso.3.6.1.4.1.232.6.2.14.11.1.1.1.3 = INTEGER: 1
iso.3.6.1.4.1.232.6.2.14.11.1.1.1.4 = INTEGER: 1
iso.3.6.1.4.1.232.6.2.14.11.1.1.1.5 = INTEGER: 1
iso.3.6.1.4.1.232.6.2.14.11.1.1.1.6 = INTEGER: 1
iso.3.6.1.4.1.232.6.2.14.11.1.1.1.7 = INTEGER: 1
iso.3.6.1.4.1.232.6.2.14.11.1.1.1.8 = INTEGER: 1
iso.3.6.1.4.1.232.6.2.14.11.1.1.2.1 = INTEGER: 2


Redundant fans
-----------------------
I saw one old server which had only half of the possible fans installed.

Fan#                               1    2      3    4      5    6

cpqHeFltTolFanPresent              yes  no     yes  no     yes  no
cpqHeFltTolFanRedundant            no   no     no   no     no   no
cpqHeFltTolFanRedundantPartner     2    1      4    3      6    5
cpqHeFltTolFanCondition            ok   other  ok   other  ok   other
cpqHeFltTolFanLocation             cpu  cpu    cpu  cpu    io   io

Normally this would result in
...
fan #1 (cpu) is not redundant
fan #2 (cpu) is not redundant
fan #3 (cpu) is not redundant
fan #4 (cpu) is not redundant
fan #5 (ioboard) is not redundant
fan #6 (ioboard) is not redundant
WARNING - fan #1 (cpu) is not redundant, fan #2 (cpu) is not redundant, fan #3 (cpu) is not redundant, fan #4 (cpu) is not redundant, fan #5 (ioboard) is not redundant, fan #6 (ioboard) is not redundant

However it was the server's owner decision not to install fan pairs but only one fan per location, so for him this is a false alert.

By using --ignore-fan-redundancy check_hpasm only looks at the cpqHeFltTolFanCondition and ignores dependencies between two fans, so the result is:

fan 1 speed is normal, pctmax is 50%, location is cpu, redundance is no, partner is 2
fan 3 speed is normal, pctmax is 50%, location is cpu, redundance is no, partner is 4
fan 5 speed is normal, pctmax is 50%, location is ioboard, redundance is no, partner is 6
OK - System: 'proliant ml370 g3', ...


A snmp forwarding trick 
-----------------------
local - where check_hpasm runs
remote - where a proliant can be reached
proliant - where the snmp agent runs

remote:
ssh -R6667:localhost:6667 local
socat tcp4-listen:6667,reuseaddr,fork UDP:proliant:161

local:
socat udp4-listen:161,reuseaddr,fork tcp:localhost:6667
check_hpasm --hostname 127.0.0.1


Sample data from real machines
------------------------------

hpasmcli=$(which hpasmcli)
hpacucli=$(which hpacucli)
for i in server powersupply fans temp dimm
do
  $hpasmcli -s "show $i" | while read line
  do
    printf "%s %s\n" $i "$line"
  done
done
if [ -x "$hpacucli" ]; then
  for i in config status
  do
    $hpacucli ctrl all show $i | while read line
    do
      printf "%s %s\n" $i "$line"
    done
  done
fi

If you think check_hpasm is not working correctly, please run the above script
and send me the output. It's also helpful to see the output of snmpwalk
snmpwalk .... 1.3.6.1.4.1.232


--
Gerhard Lausser <gerhard.lausser@consol.de>