AER not reported immediatly
sanecz opened this issue · 1 comments
Hello,
On the last versions of linux, I have issues for retrieving AER. (not tested with other errors ?).
I'm using rasdaemon 0.7.0, but I had the issue also with rasdaemon 0.6.8.
After bisecting the kernel and testing, the commit torvalds/linux@42fb0a1 seems to be the breaking change.
As the poll/read function as been fixed to function as designed, now the ring buffer needs to be filled to a certain amount before poll is notified that it has to return. So only a big amount of errors are required before the events are polled.
We tried to change the buffer_percent
and the buffer_size_kb
to the smallests values accepted (1% and 1kb, but afaik buffer_percent is not documented for the moment) on the tracefs, but still not be able to retrieve single events.
This behavior has been reproduced in the kernel v5.15.82 and v6.2-rc1.
Reproduction of the issue using aer-inject
:
I ran 73 times the ./aer-inject -s xx:xx.x examples/nonfatal
. Only when the 73 AER has been sent, the 72 others AER have been read by rasdaemon.
On my case, all of the AER were on the CPU 31.
# cat /sys/kernel/debug/tracing/instances/rasdaemon/buffer_percent
1
# cat /sys/kernel/debug/tracing/instances/rasdaemon/buffer_size_kb
1
# cat /sys/kernel/debug/tracing/instances/rasdaemon/buffer_total_size_kb
48
-- injecting 72 entries, still not poll'd by rasdaemon
# cat /sys/kernel/debug/tracing/instances/rasdaemon/per_cpu/cpu31/stats
entries: 72
overrun: 0
commit overrun: 0
bytes: 4032
oldest event ts: 8828234
now ts: 8832310
dropped events: 0
read events: 0
-- here i sent the 73th AER injection
# cat sys/kernel/debug/tracing/instances/rasdaemon/per_cpu/cpu31/stats
entries: 1
overrun: 0
commit overrun: 0
bytes: 56
oldest event ts: 8832465
now ts: 8832556
dropped events: 0
read events: 72
Added the strace here: strace.txt
Is there something to do on the rasdaemon side or do we need to report this on the kernel trace mailing list ? If so, how should we proceed ?
Seems the issue is identified and an RFC has been proposed, I'm closing the issue
Follow up:
https://lore.kernel.org/linux-edac/20230202161831.6a4fca2a@rorschach.local.home/
https://lore.kernel.org/linux-edac/20230202182352.792-1-shiju.jose@huawei.com/T/#u