AER not reported immediatly

Question

AER not reported immediatly

sanecz opened this issue 2 years ago · 1 comments

Hello,

On the last versions of linux, I have issues for retrieving AER. (not tested with other errors ?).
I'm using rasdaemon 0.7.0, but I had the issue also with rasdaemon 0.6.8.

After bisecting the kernel and testing, the commit torvalds/linux@42fb0a1 seems to be the breaking change.

As the poll/read function as been fixed to function as designed, now the ring buffer needs to be filled to a certain amount before poll is notified that it has to return. So only a big amount of errors are required before the events are polled.

https://lore.kernel.org/all/20221020231427.41be3f26@gandalf.local.home/T/#md2090ad803d1e4b2fe53bb51c9c78791445ed2ed

We tried to change the buffer_percent and the buffer_size_kb to the smallests values accepted (1% and 1kb, but afaik buffer_percent is not documented for the moment) on the tracefs, but still not be able to retrieve single events.

This behavior has been reproduced in the kernel v5.15.82 and v6.2-rc1.

Reproduction of the issue using aer-inject:
I ran 73 times the ./aer-inject -s xx:xx.x examples/nonfatal. Only when the 73 AER has been sent, the 72 others AER have been read by rasdaemon.
On my case, all of the AER were on the CPU 31.

# cat /sys/kernel/debug/tracing/instances/rasdaemon/buffer_percent 
1
# cat /sys/kernel/debug/tracing/instances/rasdaemon/buffer_size_kb 
1
# cat /sys/kernel/debug/tracing/instances/rasdaemon/buffer_total_size_kb
48
-- injecting 72 entries, still not poll'd by rasdaemon
# cat /sys/kernel/debug/tracing/instances/rasdaemon/per_cpu/cpu31/stats 
entries: 72
overrun: 0
commit overrun: 0
bytes: 4032
oldest event ts: 8828234
now ts: 8832310
dropped events: 0
read events: 0
-- here i sent the 73th AER injection
# cat sys/kernel/debug/tracing/instances/rasdaemon/per_cpu/cpu31/stats 
entries: 1
overrun: 0
commit overrun: 0
bytes: 56
oldest event ts: 8832465
now ts: 8832556
dropped events: 0
read events: 72

Added the strace here: strace.txt

Is there something to do on the rasdaemon side or do we need to report this on the kernel trace mailing list ? If so, how should we proceed ?

Answer 1 · 2023-02-06T09:56:32.000Z

Seems the issue is identified and an RFC has been proposed, I'm closing the issue
Follow up:
https://lore.kernel.org/linux-edac/20230202161831.6a4fca2a@rorschach.local.home/
https://lore.kernel.org/linux-edac/20230202182352.792-1-shiju.jose@huawei.com/T/#u