elastic/apm-queue

Tests: flaky failures

Closed this issue · 4 comments

Describe the current state

During the development of #447 I noticed some flaky tests, passing locally most of the times but not in CI:

  • consumer_test.go:224 -> ProducedWithHeaders
    metrics_test.go:64: [Metrics Data not equal: Sum[int64] not equal: Sum DataPoints not equal:
        missing expected values:
  • consumer_test.go:464 --- FAIL: TestConsumerGracefulShutdown (0.03s)
    --- FAIL: TestConsumerGracefulShutdown/AtLeastOnceDelivery (0.01s)
    --- PASS: TestConsumerGracefulShutdown/AtMostOnceDelivery (0.02s)
    
            	Error Trace:	/home/runner/work/apm-queue/apm-queue/kafka/consumer_test.go:464
        	            				/home/runner/work/apm-queue/apm-queue/kafka/consumer_test.go:469
        	Error:      	Not equal: 
        	            	expected: 2
        	            	actual  : 1
        	Test:       	TestConsumerGracefulShutdown/AtLeastOnceDelivery
  • TestProducerMetrics

Describe the desired state

The tests are consistently reporting either a failure or a success.

For TestProducerMetrics: running it with

 go test -v -count=10 -run="TestProducerMetrics/DeadlineExceeded" ./kafka/...

most certainly produces a panic

=== RUN   TestProducerMetrics/DeadlineExceeded
    metrics_test.go:65: 
                Error Trace:    /home/francesco/Projects/github.com/elastic/apm-queue/kafka/metrics_test.go:65
                                                        /home/francesco/Projects/github.com/elastic/apm-queue/kafka/metrics_test.go:123
                Error:          Not equal: 
                                expected: 2
                                actual  : 1
                Test:           TestProducerMetrics/DeadlineExceeded
    metrics_test.go:67: producer.messages.count
    metrics_test.go:71: [Name not equal:
        expected: messaging.kafka.connect_errors.count
        actual: producer.messages.count Description not equal:
        expected: Total number of connection errors, by broker
        actual: The number of messages produced Metrics Data not equal: Sum[int64] not equal: Sum DataPoints not equal:
        missing expected values:
        metricdata.DataPoint[int64]{Attributes:attribute.Set{equivalent:attribute.Distinct{iface:[1]attribute.KeyValue{attribute.KeyValue{Key:"messaging.system", Value:attribute.Value{vtype:4, numeric:0x0, stringly:"kafka", slice:interface {}(nil)}}}}}, StartTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Time:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Value:1, Exemplars:[]metricdata.Exemplar[int64](nil)}
        unexpected additional values:
        metricdata.DataPoint[int64]{Attributes:attribute.Set{equivalent:attribute.Distinct{iface:[7]attribute.KeyValue{attribute.KeyValue{Key:"error_reason", Value:attribute.Value{vtype:4, numeric:0x0, stringly:"timeout", slice:interface {}(nil)}}, attribute.KeyValue{Key:"messaging.destination.name", Value:attribute.Value{vtype:4, numeric:0x0, stringly:"default-topic", slice:interface {}(nil)}}, attribute.KeyValue{Key:"messaging.kafka.destination.partition", Value:attribute.Value{vtype:2, numeric:0x0, stringly:"", slice:interface {}(nil)}}, attribute.KeyValue{Key:"messaging.system", Value:attribute.Value{vtype:4, numeric:0x0, stringly:"kafka", slice:interface {}(nil)}}, attribute.KeyValue{Key:"namespace", Value:attribute.Value{vtype:4, numeric:0x0, stringly:"name_space", slice:interface {}(nil)}}, attribute.KeyValue{Key:"outcome", Value:attribute.Value{vtype:4, numeric:0x0, stringly:"failure", slice:interface {}(nil)}}, attribute.KeyValue{Key:"topic", Value:attribute.Value{vtype:4, numeric:0x0, stringly:"name_space-default-topic", slice:interface {}(nil)}}}}}, StartTime:time.Date(2024, time.May, 13, 17, 24, 7, 997820327, time.Local), Time:time.Date(2024, time.May, 13, 17, 24, 7, 998078794, time.Local), Value:3, Exemplars:[]metricdata.Exemplar[int64](nil)}
        ]
--- FAIL: TestProducerMetrics (0.00s)
    --- FAIL: TestProducerMetrics/DeadlineExceeded (0.00s)
panic: runtime error: index out of range [1] with length 1 [recovered]
        panic: runtime error: index out of range [1] with length 1

goroutine 7 [running]:
testing.tRunner.func1.2({0xdab780, 0xc000338078})
        /usr/local/go/src/testing/testing.go:1545 +0x238
testing.tRunner.func1()
        /usr/local/go/src/testing/testing.go:1548 +0x397
panic({0xdab780?, 0xc000338078?})
        /usr/local/go/src/runtime/panic.go:914 +0x21f
github.com/elastic/apm-queue/v2/kafka.TestProducerMetrics.func1({0xf62858, 0xc00024a5b0}, 0xc0001ff6c0, {0xf604c8, 0xc000180c60}, {0xf62c80, 0xc0000bad20}, {0xc000055a80, 0x2, 0x2}, ...)
        /home/francesco/Projects/github.com/elastic/apm-queue/kafka/metrics_test.go:70 +0x6be
github.com/elastic/apm-queue/v2/kafka.TestProducerMetrics.func2(0xc0001ff6c0?)
        /home/francesco/Projects/github.com/elastic/apm-queue/kafka/metrics_test.go:123 +0xd68
testing.tRunner(0xc0001ff6c0, 0xc000053590)
        /usr/local/go/src/testing/testing.go:1595 +0xff
created by testing.(*T).Run in goroutine 6
        /usr/local/go/src/testing/testing.go:1648 +0x3ad
FAIL    github.com/elastic/apm-queue/v2/kafka   0.008s
FAIL

From what I could see, it appears that the commit 49cfc50 introduced even more flakyness.

The offending line i if !slices.Contains(ignore, metrics[i].Name) { because when the returned metrics are less than want, we can still get to the loop (require would halt the test, but assert is used so we keep executing).

One more flaky test I've discovered, TestTopicCreatorCreateTopics fails occasionally, with logs of data race detected.
example: https://github.com/elastic/apm-queue/actions/runs/9282217459/job/25539755922#step:4:1059

@rubvs can you confirm that this can be closed now that #537 got merged?

TestProducerMetrics are still failing with:

go test -v -count=10 -failfast ./kafka/...

However, TestConsumerGracefulShutdown seems to be fixed.