Tests: flaky failures

Question

Tests: flaky failures

Closed this issue 9 days ago · 4 comments

Describe the current state

During the development of #447 I noticed some flaky tests, passing locally most of the times but not in CI:

consumer_test.go:224 -> ProducedWithHeaders

    metrics_test.go:64: [Metrics Data not equal: Sum[int64] not equal: Sum DataPoints not equal:
        missing expected values:

consumer_test.go:464 --- FAIL: TestConsumerGracefulShutdown (0.03s)

    --- FAIL: TestConsumerGracefulShutdown/AtLeastOnceDelivery (0.01s)
    --- PASS: TestConsumerGracefulShutdown/AtMostOnceDelivery (0.02s)
    
            	Error Trace:	/home/runner/work/apm-queue/apm-queue/kafka/consumer_test.go:464
        	            				/home/runner/work/apm-queue/apm-queue/kafka/consumer_test.go:469
        	Error:      	Not equal: 
        	            	expected: 2
        	            	actual  : 1
        	Test:       	TestConsumerGracefulShutdown/AtLeastOnceDelivery

TestProducerMetrics

Describe the desired state

The tests are consistently reporting either a failure or a success.

Answer 1 · 2024-05-13T15:24:56.000Z

For TestProducerMetrics: running it with

 go test -v -count=10 -run="TestProducerMetrics/DeadlineExceeded" ./kafka/...

most certainly produces a panic

=== RUN   TestProducerMetrics/DeadlineExceeded
    metrics_test.go:65: 
                Error Trace:    /home/francesco/Projects/github.com/elastic/apm-queue/kafka/metrics_test.go:65
                                                        /home/francesco/Projects/github.com/elastic/apm-queue/kafka/metrics_test.go:123
                Error:          Not equal: 
                                expected: 2
                                actual  : 1
                Test:           TestProducerMetrics/DeadlineExceeded
    metrics_test.go:67: producer.messages.count
    metrics_test.go:71: [Name not equal:
        expected: messaging.kafka.connect_errors.count
        actual: producer.messages.count Description not equal:
        expected: Total number of connection errors, by broker
        actual: The number of messages produced Metrics Data not equal: Sum[int64] not equal: Sum DataPoints not equal:
        missing expected values:
        metricdata.DataPoint[int64]{Attributes:attribute.Set{equivalent:attribute.Distinct{iface:[1]attribute.KeyValue{attribute.KeyValue{Key:"messaging.system", Value:attribute.Value{vtype:4, numeric:0x0, stringly:"kafka", slice:interface {}(nil)}}}}}, StartTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Time:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Value:1, Exemplars:[]metricdata.Exemplar[int64](nil)}
        unexpected additional values:
        metricdata.DataPoint[int64]{Attributes:attribute.Set{equivalent:attribute.Distinct{iface:[7]attribute.KeyValue{attribute.KeyValue{Key:"error_reason", Value:attribute.Value{vtype:4, numeric:0x0, stringly:"timeout", slice:interface {}(nil)}}, attribute.KeyValue{Key:"messaging.destination.name", Value:attribute.Value{vtype:4, numeric:0x0, stringly:"default-topic", slice:interface {}(nil)}}, attribute.KeyValue{Key:"messaging.kafka.destination.partition", Value:attribute.Value{vtype:2, numeric:0x0, stringly:"", slice:interface {}(nil)}}, attribute.KeyValue{Key:"messaging.system", Value:attribute.Value{vtype:4, numeric:0x0, stringly:"kafka", slice:interface {}(nil)}}, attribute.KeyValue{Key:"namespace", Value:attribute.Value{vtype:4, numeric:0x0, stringly:"name_space", slice:interface {}(nil)}}, attribute.KeyValue{Key:"outcome", Value:attribute.Value{vtype:4, numeric:0x0, stringly:"failure", slice:interface {}(nil)}}, attribute.KeyValue{Key:"topic", Value:attribute.Value{vtype:4, numeric:0x0, stringly:"name_space-default-topic", slice:interface {}(nil)}}}}}, StartTime:time.Date(2024, time.May, 13, 17, 24, 7, 997820327, time.Local), Time:time.Date(2024, time.May, 13, 17, 24, 7, 998078794, time.Local), Value:3, Exemplars:[]metricdata.Exemplar[int64](nil)}
        ]
--- FAIL: TestProducerMetrics (0.00s)
    --- FAIL: TestProducerMetrics/DeadlineExceeded (0.00s)
panic: runtime error: index out of range [1] with length 1 [recovered]
        panic: runtime error: index out of range [1] with length 1

goroutine 7 [running]:
testing.tRunner.func1.2({0xdab780, 0xc000338078})
        /usr/local/go/src/testing/testing.go:1545 +0x238
testing.tRunner.func1()
        /usr/local/go/src/testing/testing.go:1548 +0x397
panic({0xdab780?, 0xc000338078?})
        /usr/local/go/src/runtime/panic.go:914 +0x21f
github.com/elastic/apm-queue/v2/kafka.TestProducerMetrics.func1({0xf62858, 0xc00024a5b0}, 0xc0001ff6c0, {0xf604c8, 0xc000180c60}, {0xf62c80, 0xc0000bad20}, {0xc000055a80, 0x2, 0x2}, ...)
        /home/francesco/Projects/github.com/elastic/apm-queue/kafka/metrics_test.go:70 +0x6be
github.com/elastic/apm-queue/v2/kafka.TestProducerMetrics.func2(0xc0001ff6c0?)
        /home/francesco/Projects/github.com/elastic/apm-queue/kafka/metrics_test.go:123 +0xd68
testing.tRunner(0xc0001ff6c0, 0xc000053590)
        /usr/local/go/src/testing/testing.go:1595 +0xff
created by testing.(*T).Run in goroutine 6
        /usr/local/go/src/testing/testing.go:1648 +0x3ad
FAIL    github.com/elastic/apm-queue/v2/kafka   0.008s
FAIL

From what I could see, it appears that the commit 49cfc50 introduced even more flakyness.

The offending line i if !slices.Contains(ignore, metrics[i].Name) { because when the returned metrics are less than want, we can still get to the loop (require would halt the test, but assert is used so we keep executing).

Answer 2 · 2024-05-29T15:16:59.000Z

One more flaky test I've discovered, TestTopicCreatorCreateTopics fails occasionally, with logs of data race detected.
example: https://github.com/elastic/apm-queue/actions/runs/9282217459/job/25539755922#step:4:1059

Answer 3 · 2024-08-09T08:04:31.000Z

@rubvs can you confirm that this can be closed now that #537 got merged?

Answer 4 · 2024-08-09T12:33:10.000Z

TestProducerMetrics are still failing with:

go test -v -count=10 -failfast ./kafka/...

However, TestConsumerGracefulShutdown seems to be fixed.