The newest `getrusage` test failed once on Azure Pipelines

Question

The newest `getrusage` test failed once on Azure Pipelines

vvaltchev opened this issue 2 years ago · 5 comments

Log:

[devshell] [RUN   ] getrusage
[devshell] ASSERT '++n < max' FAILED.
	On:	/__w/1/s/tests/system/test_misc.c:467
	ERRNO:	Invalid argument
[devshell] The command FAILED with 1
[devshell] [FAILED] getrusage (100 ms)

Full run log:
https://dev.azure.com/vkvaltchev/Tilck/_build/results?buildId=2007&view=logs&j=a6269aa3-3cc2-5e31-00e7-4d9e12bd7faf&t=ebbbc250-5186-538c-cb5d-d884e3da9992&l=1152

cozis commented 2 years ago

On it!

Answer 1 · 2024-01-02T17:27:30.000Z

Hei Vlad! I think I understand what is going on.

The test works by calling getrusage in a loop (after doing some busy work before each call) until it detects an increase in the reported time. The test is designed to fail on timeout since at that point it's assumed that the time will never increase. I also added an attempt limit, but the idea was that it would be so high the timeout would always trigger first. Apparently Azure manages to do all that work reaching the limit in less time than the systems timer frequency (4ms), resulting in a failure. Based on the timestamps of the logs it takes only 40us to reach the limit.

I think the solution is to either:

increase the attempt limit
remove the limit entirely
remove this test entirely since it's obviously overengineered :-|

Answer 2 · 2024-01-02T22:59:29.000Z

I guess that at least for the moment, it makes sense to increase the attempt limit.
Note that in both the loops, n is increased twice per iteration:

      for (n = 0; !comes_after(buf_1.ru_utime, buf_0.ru_utime); n++) {   // <<< ONCE

         busy_wait();

         rc = getrusage(RUSAGE_SELF, &buf_1);
         DEVSHELL_CMD_ASSERT(rc == 0);
         DEVSHELL_CMD_ASSERT(++n < max);   // << TWICE
      };

Isn't that wrong? I guess we don't need the second ++n in the assert. Or am I missing something?
If we remove that second ++n and bump the max to 20k, maybe we should be fine? It's pretty rare that a test failed because on Azure we were too fast.

Answer 3 · 2024-01-02T23:27:01.000Z

Yes, I only noticed it today (quite embarassing honestly x_x). That and also the awkward ; after the for loop.

To keep the cpu running for more than 4ms the limit should be increased 10x at least, and that's for a timer frequency of 250Hz. If I'm not mistaken it can also go as low as 50Hz, requiring the limit to go over 20ms.

Isn't it a better idea to remove the limit? The timeout will catch any errors anyways. Originally I only added the limit because I didn't feel comfortable committing an unbounded loop in my first pull request. It should be fine without it, no?

Answer 4 · 2024-01-03T22:59:46.000Z

Well... removing the timer would mean timing out after a lot of time, when all the devshell tests timeout. Even if each devshell test has a timeout, that is enforced when running it in sequential mode with the run_all_tests runner, I believe that in most of cases the option -c is used to speed the things up. In that case, the runner just runs devshell -c runall with a fairly big timer. Of course, the devshell runall runner could use a non-blocking waitpid() and check the amount of time passed and then kill the test if it took too long, but that would require more code. You can do you that if you really like, but for the immediate fix of this issue let's make the timer much bigger and keep it for now, with a rough calculation for the max value. For example, 50k iterations and you remove the extra ++n that halfs the iteration limit, practically.

If that still fails from time to time, I'll deal with that, don't worry. I just wanted you to take a look first and say what you think about it.