BatchEventProcessor can try to deque event with negative timeout.

Question

BatchEventProcessor can try to deque event with negative timeout.

jkolenofferup opened this issue 4 years ago · 6 comments

The following code may ask for an item from the event_queue with a negative timeout interval.

class BatchEventProcessor(BaseEventProcessor):
    def _run(self):
        try:
            while True:
                if self._get_time() >= self.flushing_interval_deadline:
                    self._flush_batch()
                    self.flushing_interval_deadline = self._get_time() + \
                        self._get_time(self.flush_interval.total_seconds())
                    self.logger.debug('Flush interval deadline. Flushed batch.')
                try:
                    interval = self.flushing_interval_deadline - self._get_time()
                    item = self.event_queue.get(True, interval) ## interval can be negative

If the flushing_interval_deadline is between the first call of _get_time() and the second call of _get_time() then interval will be negative.

Proposed fix:

class BatchEventProcessor(BaseEventProcessor):
    def _run(self):
        try:
            while True:
                loop_time = self._get_time()  ## only call get_time once per loop iteration
                if loop_time >= self.flushing_interval_deadline:
                    self._flush_batch()
                    self.flushing_interval_deadline = self._get_time() + \
                        self._get_time(self.flush_interval.total_seconds())
                    self.logger.debug('Flush interval deadline. Flushed batch.')
                try:
                    interval = self.flushing_interval_deadline - loop_time
                    item = self.event_queue.get(True, interval)

Answer 1 · 2021-07-29T03:40:17.000Z

Thx @jkolenofferup . Can you provide a case where you were able to produce negative interval?

Answer 2 · 2021-07-29T17:35:48.000Z

It's sporadic. We have four jobs that each send about 500k events in a batch setting (batch size 100, flush time 10s). Once we dealt with the event_queue dropping problem, we get about 30 to 40 error messages of the form "{event_processor.py:212} ERROR - Uncaught exception processing buffer. Error: 'timeout' must be a non-negative number@-@".

If you are trying to replicate in a unit test.

Set flushing_interval_time to T
Let the first call to _get_time() return T - epsilon
Let the second call to _get_time() return T + epsilon

Anyways, making two requests for current time in a situation like this (expecting two calls to _get_time() to be constant) should be fixed.

Answer 3 · 2021-08-02T20:49:29.000Z

@jkolenofferup I'm curious, do you still get the 30-40 error messages when you change two _get_time() instances into one (loop_time)?
Does your suggestion fix it?
I wasn't able to get any sporadic error messages in my tests. But I was able to trigger the error message based on your suggestion (which is different than generating 500k events with 4 jobs).
I'm not sure if this setup is realistic. Looks like it just triggers the error.

Set flushing_interval_time to T
Let the first call to _get_time() return T - epsilon
Let the second call to _get_time() return T + epsilon

There must be a negative time interval at some point to trigger the errors, but writing a unit test with that makes interval negative and asserts the error message is not exactly what happens in your case?
I'm just trying to get the unit tests for the why the interval would become negative...but not easy with sporadic occurrence.

Answer 4 · 2021-08-02T21:44:15.000Z

We were able to eliminate the error message by making the flush_interval very large. We're in a batch context, so periodic flushing didn't make sense anyways. The two calls to _get_time() easily break if there is a context switch between the first and second call. In that situation there can easily be 100ms or more difference between the two. In this instance, the first call is acting as a guard to catch possible errors caused by the second call.

…

--- Don't worry about people stealing your ideas. If your ideas are any good, you'll have to ram them down people's throats. -- Howard Aiken

On Mon, Aug 2, 2021 at 3:49 PM Matjaz Pirnovar ***@***.***> wrote: @jkolenofferup <https://github.com/jkolenofferup> I'm curious, do you still get the 30-40 error messages when you eliminate two _get_time() instances, and use loot_time instead? Does your suggestion fix it? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#354 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ARY54PLHYKCFWMJD6F7MUFDT24AGJANCNFSM5BFO2FPA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email> .

Answer 5 · 2021-08-03T00:21:16.000Z

What do you mean by "context switch" here:
"The two calls to _get_time() easily break if there is a context switch
between the first and second call." ?

Answer 6 · 2021-08-03T15:30:46.000Z

A context switch as in stopping execution of current thread/process to work another thread/process. --- Don't worry about people stealing your ideas. If your ideas are any good, you'll have to ram them down people's throats. -- Howard Aiken

…

On Mon, Aug 2, 2021 at 7:21 PM Matjaz Pirnovar ***@***.***> wrote: What do you mean by "context switch" here: "The two calls to _get_time() easily break if there is a context switch between the first and second call." ? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#354 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ARY54PNW36GSMHPUVPPVJDDT24ZANANCNFSM5BFO2FPA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email> .