rasdaemon leaking memory on nodes with high volumes of errors

Question

rasdaemon leaking memory on nodes with high volumes of errors

frebib opened this issue 4 years ago · 7 comments

Over the space of several weeks, we observe the memory usage of rasdaemon growing almost linearly, seemingly infinitely.
Initially we thought this was related to this fd leak (4bf0b71), which we were also seeing. We backported 0.6.6 from Debian Bullseye but are still seeing issues.

jgroocock@47m54:~$ sudo rasdaemon -V
rasdaemon 0.6.6
jgroocock@47m54:~$ sudo ras-mc-ctl --summary
Memory controller events summary:
	Corrected on DIMM Label(s): 'CPU1_E0' location: 3:1:0:-1 errors: 1987314

No PCIe AER errors.

No Extlog errors.

DBD::SQLite::db prepare failed: no such table: devlink_event at /usr/sbin/ras-mc-ctl line 1183.
Can't call method "execute" on an undefined value at /usr/sbin/ras-mc-ctl line 1184.

There doesn't appear to be anything in the way of errors in the logs. Just event logs and nothing else

Jan 07 17:12:50 47m54 rasdaemon[3546146]: rasdaemon: mc_event store: 0x559881b894f8
Jan 07 17:12:50 47m54 rasdaemon[3546146]: rasdaemon: register inserted at db
Jan 07 17:12:50 47m54 rasdaemon[3546146]:            <...>-2159524 [024]     0.562862: mc_event:             2021-01-07 17:12:47 +0000 6 Corrected errors: memory read error on CPU1_E0 (mc: 3 location: 1:0 address: 0x20fa015f00 grain: 5  OVERFLOW err_code:0x0101:0x0091 socket:1 imc:1 rank:0 bg:0 ba:0 row:0xf40 col:0x370)
Jan 07 17:12:51 47m54 rasdaemon[3546146]: rasdaemon: mce_record store: 0x559881ba6968
Jan 07 17:12:51 47m54 rasdaemon[3546146]: rasdaemon: register inserted at db
Jan 07 17:12:51 47m54 rasdaemon[3546146]:            <...>-2159524 [024]     0.562862: mce_record:           2021-01-07 17:12:48 +0000 bank=8, status= dc00008001010091, MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error MscodDataRdErr, mci=Error_overflow Corrected_error Error_enabled, mca=M2M: , n_errors=2, cpu_type= Skylake server, cpu= 24, socketid= 1, misc= 200400c028201086, addr= 21730d5f00, mcgstatus=0, mcgcap= f000c14, apicid= 40
Jan 07 17:12:51 47m54 rasdaemon[3546146]: rasdaemon: mc_event store: 0x559881b894f8
Jan 07 17:12:51 47m54 rasdaemon[3546146]: rasdaemon: register inserted at db
Jan 07 17:12:51 47m54 rasdaemon[3546146]:            <...>-2159524 [024]     0.562862: mc_event:             2021-01-07 17:12:48 +0000 2 Corrected errors: memory read error on CPU1_E0 (mc: 3 location: 1:0 address: 0x21730d5f00 grain: 5  OVERFLOW err_code:0x0101:0x0091 socket:1 imc:1 rank:0 bg:0 ba:0 row:0xe76 col:0x370)

Over thousands of nodes, we see ~65 at present using >1GiB of memory after 3 weeks. Some after this time are using over 16GiB and climbing. Here is a graph over the last 24 days

Generally though, rasdaemon doesn't seem to use much memory at all. The 50th percentile is a mere ~1MiB, 80th being ~1.4MiB and even the 99th is only a few hundred MiB.

One thing that seems apparent is the common factor amongst the nodes with high memory usage have a seriously high volume of errors, such as the output above suggests. We suspect this is due to faulty hardware.

Is there any information I can collect/provide that would help to narrow down what is holding onto the memory?
Thanks

Answer 1 · 2021-01-08T01:30:39.000Z

Can you try with this fix? #34 It resolved one leak we were seeing.

We were able to pretty easily reproduce the problem using mce-inject (https://github.com/andikleen/mce-inject) and then doing something like the following and letting it run for a few minutes:
# while :; do ./mce-inject test/corrected ; sleep .1; done
If you watch RSS you should see it steadily climb without the patch.
For ex:
pidstat -p $(pidof rasdaemon) -r 1

Answer 2 · 2021-01-27T14:51:12.000Z

Thanks @joshuahunt I'll give that patch a try and let you know how we get on

Answer 3 · 2021-01-28T16:12:08.000Z

That PR significantly improves the situation. It's not been running for long, but it looks like the trend is much more gradual

Thanks!

Answer 4 · 2021-01-28T16:38:13.000Z

Thanks for testing @frebib. Could you let me know if the memory usage levels off after a period of time or if it still continues to grow?

Answer 5 · 2021-01-28T17:44:14.000Z

Sure, I can post an update in a few days. I'm going to deploy this everywhere tomorrow so we'll see how it performs. It looks like the memory might still be increasing ever so slightly at the moment, but it's hard to tell

Answer 6 · 2021-02-09T04:12:30.000Z

@frebib How's the memory looking now?

Answer 7 · 2021-02-09T14:10:27.000Z

Considerably better, thanks! These graphs are a 2 week view. The drop is when the patch was deployed

The p99 is way more than acceptable now

Even the worst-case seems fine, although I'm not sure how much I believe this graph because I do see the very occasional OOM