Rasdaemon wrong mapping label
garadar opened this issue · 0 comments
garadar commented
Hi all,
I have an issue with the label mapping of dimm:
First here my dimm without label:
(rubis)-[root@rubis247 ~] $ ras-mc-ctl --error-count
Label CE UE
CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 0 0
CPU_SrcID#1_Ha#0_Chan#3_DIMM#0 0 0
CPU_SrcID#0_Ha#0_Chan#3_DIMM#0 0 0
CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 0 0
CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 0 0
CPU_SrcID#1_Ha#0_Chan#0_DIMM#0 5539 0
CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 0 0
CPU_SrcID#1_Ha#0_Chan#2_DIMM#0 0 0
According to the report without label, I saw the cpu1 channel 0 slot 0 has 5539 Correctable error.
Then I label my dim according to the Intel documentation for the mainboard S2600KPR:
(rubis)-[root@rubis247 ~]$ ras-mc-ctl --mainboard
ras-mc-ctl: mainboard: Intel Corporation model S2600KPR
(rubis)-[root@rubis247 ~]$ cat /etc/ras/dimm_labels.d/intel
vendor: Intel Corporation
model: S2600KPR
# <label>: <mc>.channel>.<slot>
#CPU1
DIMM_A1: 0.0.0
DIMM_B1: 0.1.0
DIMM_C1: 0.2.0
DIMM_D1: 0.3.0
#CPU2
DIMM_E1: 1.0.0
DIMM_F1: 1.1.0
DIMM_G1: 1.2.0
DIMM_H1: 1.3.0
Then I register my label and I print them:
(rubis)-[root@rubis247 ~]$ ras-mc-ctl --print-labels
LOCATION CONFIGURED LABEL SYSFS CONTENTS
mc0 channel 0 slot 0 DIMM_A1 DIMM_A1
mc0 channel 1 slot 0 DIMM_B1 DIMM_B1
mc0 channel 2 slot 0 DIMM_C1 DIMM_C1
mc0 channel 3 slot 0 DIMM_D1 DIMM_D1
mc1 channel 0 slot 0 DIMM_E1 DIMM_E1
mc1 channel 1 slot 0 DIMM_F1 DIMM_F1
mc1 channel 2 slot 0 DIMM_G1 DIMM_G1
mc1 channel 3 slot 0 DIMM_H1 DIMM_H1
The mc1 channel 0 slot 0 correpond to the dimm E1, which seems to be the good mapping according to the documentation. So I should have the 5539 error tagged on the dimm_E1 but i Have:
(rubis)-[root@rubis247 ~]$ ras-mc-ctl --print-label
Label CE UE
DIMM_E1 0 0
DIMM_D1 0 0
DIMM_H1 0 0
DIMM_F1 0 0
DIMM_G1 0 0
DIMM_A1 5539 0
DIMM_B1 0 0
DIMM_C1 0 0
I also check the ipmi sel and it's confirming the correctable errors are on DIMM_E1 and not DIMM_A1
Maybe am I doing something wrong (or maybe a bug), someone can confirm my mind ? :)