test_informer.rb fails sporadically

Question

test_informer.rb fails sporadically

cben opened this issue 2 years ago · 8 comments

I'm seeing various sporadic failures on test_informer.rb, or sometimes it gets stuck until timeout kills it. Examples:

https://github.com/ManageIQ/kubeclient/actions/runs/3538717648/jobs/5939824652 (ruby 3.0.4)

RetryTest#test_timeout [/Users/runner/work/kubeclient/kubeclient/test/test_informer.rb:129]:
not all expectations were satisfied
unsatisfied expectations:
- expected exactly once, invoked never: #<AnyInstance:Kubeclient::Common::WatchStream>.finish(any_parameters)

https://github.com/ManageIQ/kubeclient/actions/runs/3359829904/jobs/5568290089 (ruby 2.5 got stuck killed by timeout — note 3m 0s run time)
https://github.com/ManageIQ/kubeclient/actions/runs/3540149961/jobs/5942864716 (ruby 3.1.2 timeout)

seen locally after running in a loop (ruby 2.7.5):

RetryTest#test_can_watch_watches [/home/beni/kubeclient/test/test_informer.rb:119]:
The request GET /\/v1\/watch\/pods/ was expected to execute 1 time but it executed 2 times

@grosser would you like to investigate?
I haven't looked inside, no idea if just flaky test, or actual bug...

grosser commented 2 years ago

oh no

Answer 1 · 2022-11-24T16:36:08.000Z

not using this library, so no thanks :D

Answer 2 · 2022-11-24T16:40:34.000Z

ahh it's the actual kubeclient ... did this repo get renamed ?
... I though this was something else 🤦
I'll take a look ...

Answer 3 · 2022-11-24T18:06:57.000Z

thx for the nice writeup, good to have the actual backtraces and to know it's not a single issue but multiple places

ran it 100 times on 3.0 locally and no failure
ran it 100 times on 2.7.6 and no failure

the "expected 1 got 2 watch" error would mean that the watch crashed and restarted :/

#586

... maybe you get it to fail again locally 🤞

Answer 4 · 2022-11-27T09:16:26.000Z

(Yes, repo was moved under manageiq org when Alissa @abonas was leaving Red Hat so we don't depend on her for future maintainer handoffs, and in hope Adam @agrare would join or at least be backup maintainer as I'm having less and less time for it.)

Answer 5 · 2022-11-27T09:16:37.000Z

BTW I notice with_worker does this: sleep(0.03) # wait for worker to watch which (at least in theory) is not guaranteed. And one test does sleep(0.02) # wait for watch to finish. Generally all uses of sleep() are suspect.
But I haven't dug into logic to say if any sleep race conditions are plausible explanations for any actual failure modes...

Answer 6 · 2022-11-27T22:31:46.000Z

maybe #587 fixes this ...

Answer 7 · 2023-01-05T14:28:52.000Z

Also few race conditions will be fixed in #597 that could cause the flakiness