issuu/ocaml-zmq

Zmq_test.test_proxy occasionally causes an exception

andrewray opened this issue · 6 comments

I got the following exception while running the test suite (5.0 release).

Thread 1 killed on uncaught exception Zmq.ZMQ_exception(2, "Context
was terminated")
Raised at file "zmq.ml", line 738, characters 2-11
Called from file "zmq.ml", line 552, characters 14-41
Called from file "zmq.ml", line 552, characters 14-41
Called from file "zmq_test.ml", line 141, characters 6-31
Called from file "thread.ml", line 39, characters 8-14

This is being raised by Zmq.Proxy.create. It only happens very rarely.

I wonder if there is some race condition between the OCaml thread and Zmq.Context.terminate. I am planning to try to see how repeatable I can make the error and see if a Thread.join before terminate makes any difference (or works at all!).

Centos 7.0, ZMQ 4.1.4

Small update; I took out the proxy test, ran it in a loop and got the following faults, but not the one above.

Bad file descriptor (src/signaler.cpp:282)
Segmentation fault
Assertion failed: pfd.revents & POLLIN (src/signaler.cpp:239)

Thanks for the report.
I'm unable to reproduce the error.
Is it the async or lwt version of the test suite (or both) that fails?
Please also list the ocaml version and the verson of the relevant lib (async / lwt).

Could you also describe how you run the tests?
Can you reproduce with a more recent version of libzmq? (e.g. 4.2.5).

The second failure seems to be related to zeromq/libzmq#1307.

The first test case does seem to indicate that we destroy the context before the proxy thread gets a chance to die, and a wait. Does the following patch fix the proxy test?

diff --git a/zmq/test/zmq_test.ml b/zmq/test/zmq_test.ml
index 9337e6e..1c21df8 100644
--- a/zmq/test/zmq_test.ml
+++ b/zmq/test/zmq_test.ml
@@ -144,7 +144,7 @@ let test_proxy () =
       Unix.Unix_error (Unix.ENOTSOCK, _, _) -> ()
   in
 
-  let _thread = Thread.create proxy (pull, pub) in
+  let proxy_thread = Thread.create proxy (pull, pub) in
   sleep 10;
   let sub =
     let s = Zmq.Socket.create ctx sub in
@@ -169,6 +169,7 @@ let test_proxy () =
   Zmq.Socket.close push;
   Zmq.Socket.close pull;
   Zmq.Socket.close pub;
+  Thread.join proxy_thread;
   Zmq.Context.terminate ctx;
   ()

The test was neither Async nor Lwt - though FYI we are building for Async. OCaml 4.06.

The test is being run in our CI system - so it is run a lot (though somewhat non-deterministically as there will be dozens of other tests running simultaneously), and is generally stable [1]. We have only seen this failure once, and I don't expect to see it again for a while.

I had hoped that running it aggressively in a loop would make it somewhat repeatable, but these other issues seem to stop that being possible.

I think the fix you propose makes total sense. I think I will apply it here, and then wait and see.

[1] we had one other issue with port numbers where a port number assigned in the test just happened to match one that was used in another test (again, it happened very rarely, and depended on what other tests had run at that point). I have a couple of small fixes for that I could provide.

A PR to fix the tests is more than welcome.

Closing due to lack of activity.
Please do reopen (or better yet, submit a PR) if the problem still persists.