Tests: flaky failures
Closed this issue · 4 comments
Describe the current state
During the development of #447 I noticed some flaky tests, passing locally most of the times but not in CI:
- consumer_test.go:224 -> ProducedWithHeaders
metrics_test.go:64: [Metrics Data not equal: Sum[int64] not equal: Sum DataPoints not equal:
missing expected values:
- consumer_test.go:464 --- FAIL: TestConsumerGracefulShutdown (0.03s)
--- FAIL: TestConsumerGracefulShutdown/AtLeastOnceDelivery (0.01s)
--- PASS: TestConsumerGracefulShutdown/AtMostOnceDelivery (0.02s)
Error Trace: /home/runner/work/apm-queue/apm-queue/kafka/consumer_test.go:464
/home/runner/work/apm-queue/apm-queue/kafka/consumer_test.go:469
Error: Not equal:
expected: 2
actual : 1
Test: TestConsumerGracefulShutdown/AtLeastOnceDelivery
- TestProducerMetrics
Describe the desired state
The tests are consistently reporting either a failure or a success.
For TestProducerMetrics
: running it with
go test -v -count=10 -run="TestProducerMetrics/DeadlineExceeded" ./kafka/...
most certainly produces a panic
=== RUN TestProducerMetrics/DeadlineExceeded
metrics_test.go:65:
Error Trace: /home/francesco/Projects/github.com/elastic/apm-queue/kafka/metrics_test.go:65
/home/francesco/Projects/github.com/elastic/apm-queue/kafka/metrics_test.go:123
Error: Not equal:
expected: 2
actual : 1
Test: TestProducerMetrics/DeadlineExceeded
metrics_test.go:67: producer.messages.count
metrics_test.go:71: [Name not equal:
expected: messaging.kafka.connect_errors.count
actual: producer.messages.count Description not equal:
expected: Total number of connection errors, by broker
actual: The number of messages produced Metrics Data not equal: Sum[int64] not equal: Sum DataPoints not equal:
missing expected values:
metricdata.DataPoint[int64]{Attributes:attribute.Set{equivalent:attribute.Distinct{iface:[1]attribute.KeyValue{attribute.KeyValue{Key:"messaging.system", Value:attribute.Value{vtype:4, numeric:0x0, stringly:"kafka", slice:interface {}(nil)}}}}}, StartTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Time:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Value:1, Exemplars:[]metricdata.Exemplar[int64](nil)}
unexpected additional values:
metricdata.DataPoint[int64]{Attributes:attribute.Set{equivalent:attribute.Distinct{iface:[7]attribute.KeyValue{attribute.KeyValue{Key:"error_reason", Value:attribute.Value{vtype:4, numeric:0x0, stringly:"timeout", slice:interface {}(nil)}}, attribute.KeyValue{Key:"messaging.destination.name", Value:attribute.Value{vtype:4, numeric:0x0, stringly:"default-topic", slice:interface {}(nil)}}, attribute.KeyValue{Key:"messaging.kafka.destination.partition", Value:attribute.Value{vtype:2, numeric:0x0, stringly:"", slice:interface {}(nil)}}, attribute.KeyValue{Key:"messaging.system", Value:attribute.Value{vtype:4, numeric:0x0, stringly:"kafka", slice:interface {}(nil)}}, attribute.KeyValue{Key:"namespace", Value:attribute.Value{vtype:4, numeric:0x0, stringly:"name_space", slice:interface {}(nil)}}, attribute.KeyValue{Key:"outcome", Value:attribute.Value{vtype:4, numeric:0x0, stringly:"failure", slice:interface {}(nil)}}, attribute.KeyValue{Key:"topic", Value:attribute.Value{vtype:4, numeric:0x0, stringly:"name_space-default-topic", slice:interface {}(nil)}}}}}, StartTime:time.Date(2024, time.May, 13, 17, 24, 7, 997820327, time.Local), Time:time.Date(2024, time.May, 13, 17, 24, 7, 998078794, time.Local), Value:3, Exemplars:[]metricdata.Exemplar[int64](nil)}
]
--- FAIL: TestProducerMetrics (0.00s)
--- FAIL: TestProducerMetrics/DeadlineExceeded (0.00s)
panic: runtime error: index out of range [1] with length 1 [recovered]
panic: runtime error: index out of range [1] with length 1
goroutine 7 [running]:
testing.tRunner.func1.2({0xdab780, 0xc000338078})
/usr/local/go/src/testing/testing.go:1545 +0x238
testing.tRunner.func1()
/usr/local/go/src/testing/testing.go:1548 +0x397
panic({0xdab780?, 0xc000338078?})
/usr/local/go/src/runtime/panic.go:914 +0x21f
github.com/elastic/apm-queue/v2/kafka.TestProducerMetrics.func1({0xf62858, 0xc00024a5b0}, 0xc0001ff6c0, {0xf604c8, 0xc000180c60}, {0xf62c80, 0xc0000bad20}, {0xc000055a80, 0x2, 0x2}, ...)
/home/francesco/Projects/github.com/elastic/apm-queue/kafka/metrics_test.go:70 +0x6be
github.com/elastic/apm-queue/v2/kafka.TestProducerMetrics.func2(0xc0001ff6c0?)
/home/francesco/Projects/github.com/elastic/apm-queue/kafka/metrics_test.go:123 +0xd68
testing.tRunner(0xc0001ff6c0, 0xc000053590)
/usr/local/go/src/testing/testing.go:1595 +0xff
created by testing.(*T).Run in goroutine 6
/usr/local/go/src/testing/testing.go:1648 +0x3ad
FAIL github.com/elastic/apm-queue/v2/kafka 0.008s
FAIL
From what I could see, it appears that the commit 49cfc50 introduced even more flakyness.
The offending line i if !slices.Contains(ignore, metrics[i].Name) {
because when the returned metrics
are less than want
, we can still get to the loop (require
would halt the test, but assert
is used so we keep executing).
One more flaky test I've discovered, TestTopicCreatorCreateTopics
fails occasionally, with logs of data race detected.
example: https://github.com/elastic/apm-queue/actions/runs/9282217459/job/25539755922#step:4:1059
TestProducerMetrics
are still failing with:
go test -v -count=10 -failfast ./kafka/...
However, TestConsumerGracefulShutdown
seems to be fixed.