mchehab/rasdaemon

MCE errors not showing up in ras-mc-ctl

Closed this issue · 1 comments

On a Debian 12 system using sandy bridge, can't get machine hardware errors to show up in ras-mc-ctl. Drivers are loaded. mce errors show up in dmesg.

user@ed-siad-7:~$ dmesg | tail
[  119.189596] 8021q: adding VLAN 0 to HW filter on device enp7s0
[  137.700706] sw0: port 30(dp0ce1) entered blocking state
[  137.700711] sw0: port 30(dp0ce1) entered forwarding state
[  138.453998] sw0: port 29(dp0ce0) entered blocking state
[  138.454006] sw0: port 29(dp0ce0) entered forwarding state
[  597.360562] systemd-journald[1971]: File /var/log/journal/7df98e127aa4468d8decd8f71ef99d1c/user-1000.journal corrupted or uncleanly shut down, renaming and replacing.
[ 1366.830245] mce: [Hardware Error]: Machine check events logged
[ 1366.830251] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 9: 0000000000000000
[ 1366.838823] mce: [Hardware Error]: TSC 2ac997e3063c3 
[ 1366.844468] mce: [Hardware Error]: PROCESSOR 0:50663 TIME 947185249 SOCKET 0 APIC 0 microcode 7000019
user@ed-siad-7:~$ 
user@ed-siad-7:~$ sudo ras-mc-ctl --status
ras-mc-ctl: drivers are loaded.
user@ed-siad-7:~$ 
user@ed-siad-7:~$ sudo ras-mc-ctl --summary
No Memory errors.

No PCIe AER errors.

No Extlog errors.

No MCE errors.
user@ed-siad-7:~$ 
user@ed-siad-7:~$ sudo ras-mc-ctl --errors
No Memory errors.

No PCIe AER errors.

No Extlog errors.

No MCE errors.

user@ed-siad-7:~$

journal output:

user@ed-siad-7:~$ 
vyatta@ed-siad-7:~$ sudo journalctl -u rasdaemon.service -b
Jan 26 21:48:42 ed-siad-7 systemd[1]: Starting rasdaemon.service - RAS daemon to log the RAS events...
Jan 26 21:48:43 ed-siad-7 rasdaemon[2587]: ras:mc_event event enabled
Jan 26 21:48:43 ed-siad-7 rasdaemon[2587]: rasdaemon: ras:mc_event event enabled
Jan 26 21:48:43 ed-siad-7 rasdaemon[2587]: rasdaemon: ras:aer_event event enabled
Jan 26 21:48:43 ed-siad-7 rasdaemon[2587]: rasdaemon: mce:mce_record event enabled
Jan 26 21:48:43 ed-siad-7 rasdaemon[2587]: rasdaemon: ras:extlog_mem_event event enabled
Jan 26 21:48:43 ed-siad-7 rasdaemon[2586]: rasdaemon: ras:mc_event event enabled
Jan 26 21:48:43 ed-siad-7 rasdaemon[2586]: rasdaemon: Enabled event ras:mc_event
Jan 26 21:48:43 ed-siad-7 rasdaemon[2586]: rasdaemon: ras:aer_event event enabled
Jan 26 21:48:43 ed-siad-7 rasdaemon[2586]: rasdaemon: Enabled event ras:aer_event
Jan 26 21:48:43 ed-siad-7 rasdaemon[2586]: ras:mc_event event enabled
Jan 26 21:48:43 ed-siad-7 rasdaemon[2586]: Enabled event ras:mc_event
Jan 26 21:48:43 ed-siad-7 rasdaemon[2587]: ras:aer_event event enabled
Jan 26 21:48:43 ed-siad-7 rasdaemon[2587]: mce:mce_record event enabled
Jan 26 21:48:43 ed-siad-7 rasdaemon[2587]: ras:extlog_mem_event event enabled
Jan 26 21:48:43 ed-siad-7 rasdaemon[2586]: ras:aer_event event enabled
Jan 26 21:48:43 ed-siad-7 rasdaemon[2586]: Enabled event ras:aer_event
Jan 26 21:48:43 ed-siad-7 rasdaemon[2586]: mce:mce_record event enabled
Jan 26 21:48:43 ed-siad-7 rasdaemon[2586]: rasdaemon: mce:mce_record event enabled
Jan 26 21:48:43 ed-siad-7 rasdaemon[2586]: rasdaemon: Enabled event mce:mce_record
Jan 26 21:48:43 ed-siad-7 rasdaemon[2586]: Enabled event mce:mce_record
Jan 26 21:48:43 ed-siad-7 rasdaemon[2586]: rasdaemon: ras:extlog_mem_event event enabled
Jan 26 21:48:43 ed-siad-7 rasdaemon[2586]: rasdaemon: Enabled event ras:extlog_mem_event
Jan 26 21:48:43 ed-siad-7 rasdaemon[2586]: rasdaemon: Listening to events for cpus 0 to 7
Jan 26 21:48:43 ed-siad-7 rasdaemon[2586]: ras:extlog_mem_event event enabled
Jan 26 21:48:43 ed-siad-7 rasdaemon[2586]: Enabled event ras:extlog_mem_event
Jan 26 21:48:43 ed-siad-7 rasdaemon[2586]: rasdaemon: Recording mc_event events
Jan 26 21:48:43 ed-siad-7 rasdaemon[2586]: rasdaemon: Recording aer_event events
Jan 26 21:48:43 ed-siad-7 rasdaemon[2586]: rasdaemon: Recording extlog_event events
Jan 26 21:48:43 ed-siad-7 rasdaemon[2586]: rasdaemon: Recording mce_record events
Jan 26 21:48:43 ed-siad-7 systemd[1]: Started rasdaemon.service - RAS daemon to log the RAS events.
user@ed-siad-7:~$ 

Figured out the problem. Was incorrectly injecting errors. Also Debian 12 Rasdaemon has a bug, so had to pull in Debian 13 version.