awakesecurity/grpc-mqtt

Non-deterministic test failures

Closed this issue · 5 comments

On main, running cabal test I've seen a couple different failures.

Here are the results of 3 consecutive cabal tests. A fourth run passed all tests.

Running 1 test suites...
Test suite test: RUNNING...
Tests: grpc-mqtt
    <...>
    Test.Service.ClientStream
      Test.Service.ClientStream.Unbatched: FAIL (1.52s)
        test/Test/Service.hs:431:
        expected: Just (OneInt {oneIntResult = 15})
         but got: Just (OneInt {oneIntResult = 13})
      Batched:                             SKIP
    Test.Service.ServerStream
      Test.Service.ServerStream.Unbatched: SKIP
      Test.Service.ServerStream.Batched:   SKIP
    Test.Service.BiDiStream
      Test.Service.BiDiStream.Unbatched:   SKIP
      Batched:                             SKIP
    Test.Service.Errors
      Test.Service.Errors.Timeout:         SKIP
      Test.Service.Errors.Missing:         SKIP
      Test.Service.Errors.Malform:         SKIP

9 out of 39 tests failed (13.95s)
Test suite test: FAIL
Running 1 test suites...
Test suite test: RUNNING...
Tests: grpc-mqtt
  <...>
  Test.Service
    Normal
      LongBytes:                           FAIL
        Exception: MQTTException "disconnected"
      Call:                                SKIP
    Test.Service.ClientStream
      Test.Service.ClientStream.Unbatched: SKIP
      Batched:                             SKIP
    Test.Service.ServerStream
      Test.Service.ServerStream.Unbatched: SKIP
      Test.Service.ServerStream.Batched:   SKIP
    Test.Service.BiDiStream
      Test.Service.BiDiStream.Unbatched:   SKIP
      Batched:                             SKIP
    Test.Service.Errors
      Test.Service.Errors.Timeout:         SKIP
      Test.Service.Errors.Missing:         SKIP
      Test.Service.Errors.Malform:         SKIP

11 out of 39 tests failed (20.43s)
Test suite test: FAIL
Running 1 test suites...
Test suite test: RUNNING...
Tests: grpc-mqtt
  <..>
  Test.Service
    Normal
      LongBytes:                           malloc_consolidate(): unaligned fastbin chunk detected
Test suite test: FAIL

@riz0id is this related (or at least the last failure) to the bug you mentioned the other day that is being masked by having compression turned on?

It looks like the failures started with a9a7c5f.

Running 1 test suites...
Test suite test: RUNNING...
Tests: grpc-mqtt
  <...>
  Test.Service
    Normal
      LongBytes:                           FAIL
        Exception: MQTTException "disconnected"
      Call:                                SKIP
    Test.Service.ClientStream
      Test.Service.ClientStream.Unbatched: SKIP
      Batched:                             SKIP
    Test.Service.ServerStream
      Test.Service.ServerStream.Unbatched: SKIP
      Test.Service.ServerStream.Batched:   SKIP
    Test.Service.BiDiStream
      Test.Service.BiDiStream.Unbatched:   SKIP
      Batched:                             SKIP
    Test.Service.Errors
      Test.Service.Errors.Timeout:         SKIP
      Test.Service.Errors.Missing:         SKIP
      Test.Service.Errors.Malform:         SKIP

11 out of 39 tests failed (20.43s)
Test suite test: FAIL

Currently the "LongBytes" test makes a total of 16 calls each receiving a 8mb response:

        _ <- Async.async do
          withMQTTGRPCClient logger clientConfig \client ->
            Async.replicateConcurrently 8 do
              uuid <- UUID.nextRandom

              let msg = Message.OneInt 8
              let rqt = GRPC.MQTT.MQTTNormalRequest msg 8 (GRPC.Client.MetadataMap (Map.fromList [("rqt-uuid", [UUID.toASCIIBytes uuid])]))

              testServicecallLongBytes (testServiceMqttClient client baseTopic) rqt

        withMQTTGRPCClient logger clientConfig \client ->
          Async.replicateConcurrently 8 do

            -- For uniquely identifying requests to the server.
            uuid <- UUID.nextRandom

            -- NB: 2022-08-02 we discovered a bug with concurrent client
            -- requests that send responses which, when sent back by the
            -- server trigger a GRPCIOTimeout error in some of the clients.
            let msg = Message.OneInt 8
            let rqt = GRPC.MQTT.MQTTNormalRequest msg 8 (GRPC.Client.MetadataMap (Map.fromList [("rqt-uuid", [UUID.toASCIIBytes uuid])]))

            testServicecallLongBytes (testServiceMqttClient client baseTopic) rqt

MQTTDisconnected is from a timeout result being eaten. The first of these two MQTTGRPCClients (the one that is ignored) timing out which is causing the second call to be disconnected. Anyway, the test needs to be re-written, the ignored result is an artifact from late night debugging. The correct test code would just be:

        withMQTTGRPCClient logger clientConfig \client ->
          Async.replicateConcurrently 8 do

            -- For uniquely identifying requests to the server.
            uuid <- UUID.nextRandom

            -- NB: 2022-08-02 we discovered a bug with concurrent client
            -- requests that send responses which, when sent back by the
            -- server trigger a GRPCIOTimeout error in some of the clients.
            let msg = Message.OneInt 8
            let rqt = GRPC.MQTT.MQTTNormalRequest msg 8 (GRPC.Client.MetadataMap (Map.fromList [("rqt-uuid", [UUID.toASCIIBytes uuid])]))

            testServicecallLongBytes (testServiceMqttClient client baseTopic) rqt

I can fix this, I'm still looking at the other issues.

@Friede80 @ixmatus I'm unable to replicate the two remaining issues after fixing the test in #41. I think its very likely that in the case the second "LongBytes" call finished before the first call (made with Async.async) was left running and resulted in some undefined weirdness. If this is true, it would explain why test results are non-deterministic, since the specific way in which the stray async caused issues in the test would be a race condition between the first and second "LongBytes" client calls.

Thank you for finding this @Friede80 !