mchehab/rasdaemon

Rasdaemon wrong mapping label

garadar opened this issue · 0 comments

Hi all,

I have an issue with the label mapping of dimm:

First here my dimm without label:

(rubis)-[root@rubis247 ~] $ ras-mc-ctl --error-count
Label                         	CE	UE
CPU_SrcID#0_Ha#0_Chan#0_DIMM#0	0	0
CPU_SrcID#1_Ha#0_Chan#3_DIMM#0	0	0
CPU_SrcID#0_Ha#0_Chan#3_DIMM#0	0	0
CPU_SrcID#0_Ha#0_Chan#1_DIMM#0	0	0
CPU_SrcID#0_Ha#0_Chan#2_DIMM#0	0	0
CPU_SrcID#1_Ha#0_Chan#0_DIMM#0	5539	0
CPU_SrcID#1_Ha#0_Chan#1_DIMM#0	0	0
CPU_SrcID#1_Ha#0_Chan#2_DIMM#0	0	0

According to the report without label, I saw the cpu1 channel 0 slot 0 has 5539 Correctable error.

Then I label my dim according to the Intel documentation for the mainboard S2600KPR:

https://www.intel.com/content/dam/support/us/en/documents/server-products/server-boards/S2600KP_HNS2600KP.pdf
Page 54

(rubis)-[root@rubis247 ~]$ ras-mc-ctl --mainboard
ras-mc-ctl: mainboard: Intel Corporation model S2600KPR
(rubis)-[root@rubis247 ~]$ cat /etc/ras/dimm_labels.d/intel
vendor: Intel Corporation
  model: S2600KPR
#  <label>: <mc>.channel>.<slot>
    #CPU1
    DIMM_A1: 0.0.0
    DIMM_B1: 0.1.0
    DIMM_C1: 0.2.0
    DIMM_D1: 0.3.0

    #CPU2
    DIMM_E1: 1.0.0
    DIMM_F1: 1.1.0
    DIMM_G1: 1.2.0
    DIMM_H1: 1.3.0

Then I register my label and I print them:

(rubis)-[root@rubis247 ~]$ ras-mc-ctl --print-labels
LOCATION                            CONFIGURED LABEL     SYSFS CONTENTS      
mc0 channel 0 slot 0                DIMM_A1              DIMM_A1             
mc0 channel 1 slot 0                DIMM_B1              DIMM_B1             
mc0 channel 2 slot 0                DIMM_C1              DIMM_C1             
mc0 channel 3 slot 0                DIMM_D1              DIMM_D1             
mc1 channel 0 slot 0                DIMM_E1              DIMM_E1             
mc1 channel 1 slot 0                DIMM_F1              DIMM_F1             
mc1 channel 2 slot 0                DIMM_G1              DIMM_G1             
mc1 channel 3 slot 0                DIMM_H1              DIMM_H1

The mc1 channel 0 slot 0 correpond to the dimm E1, which seems to be the good mapping according to the documentation. So I should have the 5539 error tagged on the dimm_E1 but i Have:

(rubis)-[root@rubis247 ~]$ ras-mc-ctl --print-label
Label  	CE	UE
DIMM_E1	0	0
DIMM_D1	0	0
DIMM_H1	0	0
DIMM_F1	0	0
DIMM_G1	0	0
DIMM_A1	5539	0
DIMM_B1	0	0
DIMM_C1	0	0

I also check the ipmi sel and it's confirming the correctable errors are on DIMM_E1 and not DIMM_A1

Maybe am I doing something wrong (or maybe a bug), someone can confirm my mind ? :)