apache/openwhisk-runtime-go

Travis Test failing randomly: Example_shell_nocompiler missing Goodbye!

csantanapr opened this issue · 8 comments

Today May 19th Travis build failed: https://travis-ci.org/apache/incubator-openwhisk-runtime-go/builds/381051039#L1380

=== RUN   Example_shell_nocompiler
2018/05/19 11:53:01 /tmp/action839309642
2018/05/19 11:53:01 http://127.0.0.1:40685
2018/05/19 11:53:01 binary
2018/05/19 11:53:01 starting /tmp/action839309642/1/main
2018/05/19 11:53:01 run: start
2018/05/19 11:53:01 service: start
2018/05/19 11:53:01 logger: start
2018/05/19 11:53:01 >>>{"name":"Mike"}
2018/05/19 11:53:01 <<<{"hello": "Mike"}
2018/05/19 11:53:01 {"hello": "Mike"}
2018/05/19 11:53:01 >>>{"name":"*"}
2018/05/19 11:53:01 run: end
--- FAIL: Example_shell_nocompiler (1.03s)
got:
400 {"error":"no action defined yet"}
200 {"ok":true}
200 {"hello": "Mike"}
400 {"error":"command exited"}
400 {"error":"no action defined yet"}
msg=hello Mike
XXX_THE_END_OF_A_WHISK_ACTIVATION_XXX
XXX_THE_END_OF_A_WHISK_ACTIVATION_XXX
want:
400 {"error":"no action defined yet"}
200 {"ok":true}
200 {"hello": "Mike"}
400 {"error":"command exited"}
400 {"error":"no action defined yet"}
msg=hello Mike
XXX_THE_END_OF_A_WHISK_ACTIVATION_XXX
Goodbye!
XXX_THE_END_OF_A_WHISK_ACTIVATION_XXX

@sciabarracom Could you look into this failure?

Yes I think also this is related to the timeouts. There are a few timeouts in the code (2 or 3) and I did not consider under the CI execution can be MUUUUCH slower. I am going to review all of them and parametrize timings for tests.

@sciabarra This test still fails even after the timeout changes
https://travis-ci.org/apache/incubator-openwhisk-runtime-go/builds/381940363#L1361

I think this one is related to the bash script exiting and the stderr content not being flushed.

=== RUN   Example_shell_nocompiler
2018/05/22 01:06:21 /tmp/action205414389
2018/05/22 01:06:21 http://127.0.0.1:38717
2018/05/22 01:06:21 binary
2018/05/22 01:06:21 starting /tmp/action205414389/1/main
2018/05/22 01:06:21 run: start
2018/05/22 01:06:22 service: start
2018/05/22 01:06:22 logger: start
2018/05/22 01:06:22 >>>{"name":"Mike"}
2018/05/22 01:06:22 <<<{"hello": "Mike"}
2018/05/22 01:06:22 {"hello": "Mike"}
2018/05/22 01:06:22 >>>{"name":"*"}
2018/05/22 01:06:22 run: end
--- FAIL: Example_shell_nocompiler (2.02s)
got:
400 {"error":"no action defined yet"}
200 {"ok":true}
200 {"hello": "Mike"}
400 {"error":"command exited"}
400 {"error":"no action defined yet"}
msg=hello Mike
XXX_THE_END_OF_A_WHISK_ACTIVATION_XXX
XXX_THE_END_OF_A_WHISK_ACTIVATION_XXX
want:
400 {"error":"no action defined yet"}
200 {"ok":true}
200 {"hello": "Mike"}
400 {"error":"command exited"}
400 {"error":"no action defined yet"}
msg=hello Mike
XXX_THE_END_OF_A_WHISK_ACTIVATION_XXX
Goodbye!
XXX_THE_END_OF_A_WHISK_ACTIVATION_XXX

I don't see logger: end in the test run. even when test pass locally

Since Go multithreading is cooperative, we have to give an opportunit for the other goroutine to run. And log draining is done by another goroutine. Unfortunately it is not deterministic which goroutine runs. Also since test "ends" it may detect a failure just because there was not enough output collected yet. However, I moved the runtime.Gosched() inside the loop to give an opportunity to run for each line of output isteand of once, and this fixed a lot of issues on OSX. I think however we need also a runtime.Gosched() at the beginning of the loop (where it was originally). Furthermore, there are 2 timeouts now, https://github.com/apache/incubator-openwhisk-runtime-go/blob/master/openwhisk/executor.go#L34 and in the TestMain I only changed one, but there is also a timeout waiting for the log draining. Last but not least there is a timeout after the tests. Will try to increase those timeouts too.

I increased ALL the timeouts in PR#38. Inclusing notably timeouts for draining log, in some tests and for whole test. The issue is simply that the stderror is not drained before the test end. Now I wait 2 seconds an entire test to end, and 100ms (instead of 5) for reading the stderr. Should be enough?

Test is passing now in Travis thanks !