quixio/quix-streams

Test ordering causes failures in very specific circumstances.

Closed this issue · 0 comments

The bug

In v2.7 exactly-once was introduced, and some tests were parameterized to confirm consistent behavior around ALOS and EOS.

In one of those tests, TestAppGroupBy - test_group_by_with_window...the test ordering between ALOS and EOS appears to matter and cause test failures, but only in specific circumstances.

The bug occurs specifically when:

  1. All of test_app (or the entire suite) is run
  2. test_group_by_with_window runs ALOS first, then EOS

This will cause the EOS test to fail.

The actual reason is unknown, but more details will be provided below.

Behavior around the bug

As mentioned, this bug appears very specific.

  • If you add a sufficient sleep step (2-3s) before running the app, or add additional logging that slows app processing, EOS will pass consistently.
  • Switching the order to EOS, ALOS causes both to pass consistently.
  • Running only TestAppGroupBy (rather than all of test_app) causes EOS to pass consistently.
  • If you add multiple parameter runs like [ALOS, EOS, ALOS, EOS,...], only the first EOS fails.
  • I have not encountered this behavior on any other test.

What is failing in the test

Test failure

message processed_count validation fails, where actual (1) != expected (2)

Source of failure

Basically, the app cannot seem to successfully subscribe to (the beginning of?) the groupby repartition topic, and thus cannot read the group-by'd message.

This is pretty easily surmised from the warning thrown in the stack trace:

[WARNING] quixstreams.kafka.consumer: OFFSET [rdkafka#consumer-139] [thrd:main]: repartition__46fd98ea-e020-4177-b444-115ad5311adc--0975fdbb-7076-4e54-a985-671692120a93--user [0]: offset reset (at offset 0 (leader epoch 0), broker 0) to offset BEGINNING (leader epoch -1): fetch failed due to requested offset not available on the broker: Broker: Offset out of range

What's strange is: there are NO messages on this topic to begin with, and thus there should be no offset at all...so I'm not sure how the offset is "out of range". The first (and only!) message gets put on the topic from the app itself during processing.

There's nothing else useful in the stack trace...the app otherwise appears to run normally, and basically fails due to the app hitting the app runtime timeout of 10s we impose on it (which honestly is perhaps the most concerning part of this).

What I've tried

Basically the stuff listed above.

We recently changed our test container image due to another bug we had where the container was consistently failing on a specific test, and we were able to replicate the issue outside of the test (it failed on every Nth run, and ONLY that one...was very weird). I found the issue by inspecting a topic and finding messages were not being produced to it successfully as expected, which manifested as the lowwater being 2 instead of 0.

I thought this issue might be something similar, so I tried printing out the consumer's assigned topic offsets/watermarks but that only caused the test to pass...wonder if the extra consumer querying caused something to "settle" correctly.

I also confirmed the topics and state are unique per run, though running several parameterized tests back to back also confirmed that to be true otherwise we'd probably fail the same tests consistently.

It might be valuable to try and replicate this result outside of testing like I did with the previous issue (have not attempted that).

Other thoughts

I think this is just a rare bug that we've encountered due to some specific circumstance created by running numerous tests in quick succession. I don't think anything is actually broken in terms of implementation.