mchehab/rasdaemon

rasdaemon leaking memory on nodes with high volumes of errors

frebib opened this issue · 7 comments

Over the space of several weeks, we observe the memory usage of rasdaemon growing almost linearly, seemingly infinitely.
Initially we thought this was related to this fd leak (4bf0b71), which we were also seeing. We backported 0.6.6 from Debian Bullseye but are still seeing issues.

jgroocock@47m54:~$ sudo rasdaemon -V
rasdaemon 0.6.6
jgroocock@47m54:~$ sudo ras-mc-ctl --summary
Memory controller events summary:
	Corrected on DIMM Label(s): 'CPU1_E0' location: 3:1:0:-1 errors: 1987314

No PCIe AER errors.

No Extlog errors.

DBD::SQLite::db prepare failed: no such table: devlink_event at /usr/sbin/ras-mc-ctl line 1183.
Can't call method "execute" on an undefined value at /usr/sbin/ras-mc-ctl line 1184.

There doesn't appear to be anything in the way of errors in the logs. Just event logs and nothing else

Jan 07 17:12:50 47m54 rasdaemon[3546146]: rasdaemon: mc_event store: 0x559881b894f8
Jan 07 17:12:50 47m54 rasdaemon[3546146]: rasdaemon: register inserted at db
Jan 07 17:12:50 47m54 rasdaemon[3546146]:            <...>-2159524 [024]     0.562862: mc_event:             2021-01-07 17:12:47 +0000 6 Corrected errors: memory read error on CPU1_E0 (mc: 3 location: 1:0 address: 0x20fa015f00 grain: 5  OVERFLOW err_code:0x0101:0x0091 socket:1 imc:1 rank:0 bg:0 ba:0 row:0xf40 col:0x370)
Jan 07 17:12:51 47m54 rasdaemon[3546146]: rasdaemon: mce_record store: 0x559881ba6968
Jan 07 17:12:51 47m54 rasdaemon[3546146]: rasdaemon: register inserted at db
Jan 07 17:12:51 47m54 rasdaemon[3546146]:            <...>-2159524 [024]     0.562862: mce_record:           2021-01-07 17:12:48 +0000 bank=8, status= dc00008001010091, MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error MscodDataRdErr, mci=Error_overflow Corrected_error Error_enabled, mca=M2M: , n_errors=2, cpu_type= Skylake server, cpu= 24, socketid= 1, misc= 200400c028201086, addr= 21730d5f00, mcgstatus=0, mcgcap= f000c14, apicid= 40
Jan 07 17:12:51 47m54 rasdaemon[3546146]: rasdaemon: mc_event store: 0x559881b894f8
Jan 07 17:12:51 47m54 rasdaemon[3546146]: rasdaemon: register inserted at db
Jan 07 17:12:51 47m54 rasdaemon[3546146]:            <...>-2159524 [024]     0.562862: mc_event:             2021-01-07 17:12:48 +0000 2 Corrected errors: memory read error on CPU1_E0 (mc: 3 location: 1:0 address: 0x21730d5f00 grain: 5  OVERFLOW err_code:0x0101:0x0091 socket:1 imc:1 rank:0 bg:0 ba:0 row:0xe76 col:0x370)

Over thousands of nodes, we see ~65 at present using >1GiB of memory after 3 weeks. Some after this time are using over 16GiB and climbing. Here is a graph over the last 24 days
image
Generally though, rasdaemon doesn't seem to use much memory at all. The 50th percentile is a mere ~1MiB, 80th being ~1.4MiB and even the 99th is only a few hundred MiB.

One thing that seems apparent is the common factor amongst the nodes with high memory usage have a seriously high volume of errors, such as the output above suggests. We suspect this is due to faulty hardware.

Is there any information I can collect/provide that would help to narrow down what is holding onto the memory?
Thanks

Can you try with this fix? #34 It resolved one leak we were seeing.

We were able to pretty easily reproduce the problem using mce-inject (https://github.com/andikleen/mce-inject) and then doing something like the following and letting it run for a few minutes:
# while :; do ./mce-inject test/corrected ; sleep .1; done
If you watch RSS you should see it steadily climb without the patch.
For ex:
pidstat -p $(pidof rasdaemon) -r 1

Thanks @joshuahunt I'll give that patch a try and let you know how we get on

That PR significantly improves the situation. It's not been running for long, but it looks like the trend is much more gradual
image

Thanks!

Thanks for testing @frebib. Could you let me know if the memory usage levels off after a period of time or if it still continues to grow?

Sure, I can post an update in a few days. I'm going to deploy this everywhere tomorrow so we'll see how it performs. It looks like the memory might still be increasing ever so slightly at the moment, but it's hard to tell

@frebib How's the memory looking now?

Considerably better, thanks! These graphs are a 2 week view. The drop is when the patch was deployed

The p99 is way more than acceptable now
image

Even the worst-case seems fine, although I'm not sure how much I believe this graph because I do see the very occasional OOM
image