mchehab/rasdaemon

ug! no event found for type 843

jhoblitt opened this issue · 8 comments

Building the rpm from c225517 on centos 7 results in the logs being spammed with ug! no event found for type 843. The ug! message is repeated 16468 times in the journal but there are also journal rate limit messages, so the total is probably much higher.

-- Logs begin at Sun 2022-05-15 21:28:23 UTC, end at Mon 2022-05-16 18:20:01 UTC. --
May 16 18:17:20 foo06.example.org systemd[1]: Starting RAS daemon to log the RAS events...
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: Page offline choice on Corrected Errors is soft
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: Threshold of memory Corrected Errors is 50 / 24h
May 16 18:17:22 foo06.example.org rasdaemon[12672]: rasdaemon: ras:mc_event event enabled
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: ras:mc_event event enabled
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: Enabled event ras:mc_event
May 16 18:17:22 foo06.example.org rasdaemon[12672]: rasdaemon: ras:aer_event event enabled
May 16 18:17:22 foo06.example.org rasdaemon[12672]: rasdaemon: mce:mce_record event enabled
May 16 18:17:22 foo06.example.org rasdaemon[12672]: rasdaemon: ras:extlog_mem_event event enabled
May 16 18:17:22 foo06.example.org rasdaemon[12672]: rasdaemon: Can't write to set_event
May 16 18:17:22 foo06.example.org rasdaemon[12672]: rasdaemon: Can't write to set_event
May 16 18:17:22 foo06.example.org rasdaemon[12672]: rasdaemon: Can't write to set_event
May 16 18:17:22 foo06.example.org rasdaemon[12672]: rasdaemon: block:block_rq_complete event enabled
May 16 18:17:22 foo06.example.org rasdaemon[12672]: rasdaemon: Can't write to set_event
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: ras:aer_event event enabled
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: Enabled event ras:aer_event
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: Can't get ras:non_standard_event traces. Perhaps this feature is not supported on your system.
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: Can't get traces from ras:non_standard_event
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: Can't get ras:arm_event traces. Perhaps this feature is not supported on your system.
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: Can't get traces from ras:arm_event
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: mce:mce_record event enabled
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: Enabled event mce:mce_record
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: ras:extlog_mem_event event enabled
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: Enabled event ras:extlog_mem_event
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: Can't get net:net_dev_xmit_timeout traces. Perhaps this feature is not supported on your system.
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: Can't get devlink:devlink_health_report traces. Perhaps this feature is not supported on your system.
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: Can't get traces from devlink:devlink_health_report
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: Can't write to filter file
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: Can't get ras:memory_failure_event traces. Perhaps this feature is not supported on your system.
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: Can't get traces from ras:memory_failure_event
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: Listening to events for cpus 0 to 63
May 16 18:17:22 foo06.example.org systemd[1]: Started RAS daemon to log the RAS events.
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: Recording mc_event events
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: Recording aer_event events
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: Recording extlog_event events
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: Recording mce_record events
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: Recording non_standard_event events
May 16 18:17:22 foo06.example.org rasdaemn[12671]: rasdaemon: Recording arm_event events
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: Recording devlink_event events
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: Recording disk_errors events
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: Recording memory_failure_event events
May 16 18:17:22 foo06.example.org rasdaemon[12671]: trace-cmd: No such file or directory
May 16 18:17:22 foo06.example.org rasdaemon[12671]: ug! no event found for type 843
May 16 18:17:22 foo06.example.org rasdaemon[12671]: overriding event (968) ras:mc_event with new print handler
May 16 18:17:22 foo06.example.org rasdaemon[12671]: overriding event (967) ras:aer_event with new print handler
May 16 18:17:22 foo06.example.org rasdaemon[12671]: overriding event (82) mce:mce_record with new print handler
May 16 18:17:22 foo06.example.org rasdaemon[12671]: overriding event (969) ras:extlog_mem_event with new print handler
May 16 18:17:22 foo06.example.org rasdaemon[12671]: Calling ras_mc_event_opendb()
May 16 18:17:22 foo06.example.org rasdaemon[12671]: ug! no event found for type 843
May 16 18:17:22 foo06.example.org rasdaemon[12671]: ug! no event found for type 843
May 16 18:17:22 foo06.example.org rasdaemon[12671]: ug! no event found for type 843
mtds commented

I believe you have encountered a similar problem as described in issue #19

The type number was different but it basically boiled down to the fact that rasdaemon is looking for hardware which is not able to identify and/or properly interact with. Only solution was to rebuild rasdaemon from the sources, enabling only a subset of its features.

#19 looks extremely similar, including that I'm testing this on an epyc 7xx2 CPU. I'll try rebuilding with reduced feature flags.

Using the flags from #19 as a starting point, I was able to get a build that doesn't spam the log with the !ug errors. It looks like --enable-diskerror was the culprit.

--- a/misc/rasdaemon.spec.in
+++ b/misc/rasdaemon.spec.in
@@ -39,7 +39,8 @@ an utility for reporting current error counts from the EDAC sysfs files.
 %setup -q
 
 %build
-%configure --enable-all --with-sysconfdefdir=%{_sysconfdir}/sysconfig
+%configure --enable-sqlite3 --enable-aer --enable-non-standard --enable-mce --enable-extlog --enable-devlink \
+--enable-abrt-report --enable-hisi-ns-decode --enable-memory-ce-pfa --enable-memory-failure
 
 make %{?_smp_mflags}

I may have spoken too soon. The ug! messages are appearing now with the same build when restarting the daemon.

Even cutting the flags down to only --enable-sqlite3 doesn't resolve the log messages.

compile time options summary
============================

    Sqlite3             : yes
    AER                 : no
    MCE                 : no
    EXTLOG              : no
    CPER non-standard   : no
    ABRT report         : no
    HISI Kunpeng errors : no
    ARM events          : no
    DEVLINK             : no
    Disk I/O errors     : no
    Memory Failure      : no
    Memory CE PFA       : no
    AMP RAS errors      : no

It looks like the error messages are coming in a large batch every ~30s.

From looking at the libevent code it appears that the type # is coming from the kernel event message. If so, how does one map the type in back to the kernel call site?

Basically, libevent is an early version of a code that was later packaged as libtraceevent.
I updated it to use libtraceevent after version 0.7.0. This should hopefully help solving issues when decoding events, as such library is maintained altogether with the Kernel code.