ohwgiles/laminar

Possible memory leak - LaminarFixture.Abort

Valicek1 opened this issue · 5 comments

Hi,
I have checked out latest sources, built my docker container containing laminar under debian and ran tests.
Task LaminarFixture.Abort hungs, eats up one cpu core and eventually fills memory until kernel OOM hits. Second scenario is coredump.

Step 10/12 : RUN ./laminar-tests
 ---> Running in 6feaf8318534
[==========] Running 19 tests from 3 test cases.
[----------] Global test environment set-up.
[----------] 9 tests from LaminarFixture
[ RUN      ] LaminarFixture.EmptyStatusMessageStructure
[       OK ] LaminarFixture.EmptyStatusMessageStructure (230 ms)
[ RUN      ] LaminarFixture.JobNotifyHomePage
[       OK ] LaminarFixture.JobNotifyHomePage (325 ms)
[ RUN      ] LaminarFixture.OnlyRelevantNotifications
[       OK ] LaminarFixture.OnlyRelevantNotifications (616 ms)
[ RUN      ] LaminarFixture.FailedStatus
[       OK ] LaminarFixture.FailedStatus (290 ms)
[ RUN      ] LaminarFixture.WorkingDirectory
[       OK ] LaminarFixture.WorkingDirectory (350 ms)
[ RUN      ] LaminarFixture.Environment
[       OK ] LaminarFixture.Environment (390 ms)
[ RUN      ] LaminarFixture.ParamsToEnv
[       OK ] LaminarFixture.ParamsToEnv (390 ms)
[ RUN      ] LaminarFixture.Abort
Segmentation fault (core dumped)

Building, as said before, in docker, under debian buster updated beforehand during docker build. Machine architecture is amd64.
I have found same commit as it was for last successful ci build of laminar laminar/156
I'll try to find which commit causes that behavior, as I'll try to build under different rootfs (maybe bullseye, which is stable now).

After running tests in brand-new docker container based on recent debian-bullseye, it seems it changed between these two commits in list:

Fail: 06a5f3d8ef3e0b799ac8a92fd54808b05ac4212f assign run numbers at queue time
Success: 6d2c0b208bb8c273e2da1afb08f4dec5f190dc48 fix LAST_RESULT env var

The OOM is probably because that test uses yes as a job script, so the log grows forever. I don't think that can be considered a leak, but it might be worth implementing some limit on the size of a log. Anyway you could verify that by replacing yes with echo test; sleep inf.

The real issue is why the abort command hangs. I can't easily reproduce that, could you post a minimal script or Dockerfile that does?

Hi, sorry for late reply. I made up minimal example dockerfile, which is failing for me. See gist. Tried with upstream laminar git repository clone.

One more thing: after replacing yes with sleep inf or whatever, tests pass. Isn't possible, that reading stdout from yes locks laminar from receiving signal to abort job?

EDIT: yes produces 3.7 Gigabytes of output per second.

I think using sleep inf instead of yes is a valid solution to this. If you open a PR I'll merge it.