mailgun/kafka-pixy

Can hang on shutdown without explaining what it's waiting for

Opened this issue · 6 comments

Pixy hangs on shutdown after being used. In our case it's a single client connected to a single topic with a single partition, which produced and consumed a few messages.

Two example logs. First was SIGINT the second was SIGTERM, in case it matters.

^C2020-03-27 10:52:08.675328 Z info </service.0> "Shutting down"
2020-03-27 10:52:08.675378 Z info </service.0/default_pxy_stop.0> Started
2020-03-27 10:52:08.675427 Z info </default.0/adm_stop.0> Started
2020-03-27 10:52:08.675443 Z info </default.0/adm_stop.0> Stopped
2020-03-27 10:52:08.675446 Z info </default.0/prod_stop.0> Started
2020-03-27 10:52:08.675451 Z info </default.0/cons_stop.0> Started
2020-03-27 10:52:08.675523 Z info </default.0/cons.0/disp.0> "Shutting down"
2020-03-27 10:52:08.675491 Z info </default.0/prod_disp.0> "About to stop producer: pendingMsgCount=0"
2020-03-27 10:52:08.675570 Z info </default.0/prod_disp.0> "Stopping producer: pendingMsgCount=0"
2020-03-27 10:52:08.675542 Z info </default.0/cons.0/ussjc-bx-001.ts.example.com.0/disp.0> "Shutting down" kafka.group=ussjc-bx-001.ts.example.com
2020-03-27 10:52:08.675588 Z info "Producer shutting down." category=sarama
2020-03-27 10:52:08.675625 Z info </default.0/prod_merg.0> Stopped
2020-03-27 10:52:08.675639 Z info </default.0/prod_disp.0> Stopped
2020-03-27 10:52:08.675650 Z info "producer/broker/1001 input chan closed" category=sarama
2020-03-27 10:52:08.675665 Z info "producer/broker/1001 shut down" category=sarama
2020-03-27 10:52:08.675665 Z info </default.0/prod_stop.0> Stopped
2020-03-27 10:54:33.659139 Z info </service.0> "Shutting down"
2020-03-27 10:54:33.659175 Z info </service.0/default_pxy_stop.0> Started
2020-03-27 10:54:33.659194 Z info </default.0/prod_stop.0> Started
2020-03-27 10:54:33.659210 Z info </default.0/prod_disp.0> "About to stop producer: pendingMsgCount=0"
2020-03-27 10:54:33.659217 Z info </default.0/prod_disp.0> "Stopping producer: pendingMsgCount=0"
2020-03-27 10:54:33.659222 Z info </default.0/adm_stop.0> Started
2020-03-27 10:54:33.659226 Z info "Producer shutting down." category=sarama
2020-03-27 10:54:33.659240 Z info </default.0/adm_stop.0> Stopped
2020-03-27 10:54:33.659246 Z info </default.0/prod_merg.0> Stopped
2020-03-27 10:54:33.659251 Z info </default.0/prod_disp.0> Stopped
2020-03-27 10:54:33.659259 Z info "producer/broker/1001 input chan closed" category=sarama
2020-03-27 10:54:33.659269 Z info "producer/broker/1001 shut down" category=sarama
2020-03-27 10:54:33.659225 Z info </default.0/cons_stop.0> Started
2020-03-27 10:54:33.659278 Z info </default.0/cons.0/disp.0> "Shutting down"
2020-03-27 10:54:33.659285 Z info </default.0/cons.0/ussjc-bx-001.ts.example.com.0/disp.0> "Shutting down" kafka.group=ussjc-bx-001.ts.example.com
2020-03-27 10:54:33.659254 Z info </default.0/prod_stop.0> Stopped

What version are you using?

Have you acknowledged all consumed messages before stopping? Alternatively you could have used auto_ack=true when consuming. Before stopping kafka-pixy waits for acknowledgments of consumed messages for proxies.<proxy-name>.consumer.ack_timeout which is 5 minutes by default.

What version are you using?

v0.17.0

Before stopping kafka-pixy waits for acknowledgments of consumed messages for proxies..consumer.ack_timeout which is 5 minutes by default.

That seems like a long time for a process to wait after a SIGTERM (especially without logging an explanatory message).

I'd expect kafka-pixy to stop promptly, dropping the connection, and for the client's Ack to fail with a gRPC error so the client would know the Ack didn't succeed.

What does "before retrying" mean in this context:

      # Period of time that Kafka-Pixy should wait for an acknowledgement
      # before retrying.
      ack_timeout: 5m

That seems like a long time for a process to wait after a SIGTERM (especially without logging an explanatory message).

If kafka-pixy stops immediately that would result in unacknowledged messages to be offered to another client later, possibly by another kafka-pixy instance, hence it is advisable to set this parameter to be greater than the longest it takes to handle your messages.

What does "before retrying" mean in this context:

It means that if a message is not acknowledged in ack_timeout time, it will be offered by kafka-pixy again.

Thanks for the explanation. I agree it's working as designed.

My suggested changes are:

  • clarify the wording of the ack_timeout comments by explaining the effect, both on messages and on shutdown.
  • if a shutdown is blocked waiting on ack_timeout log a message to that effect.
  • add an config option to shutdown more aggressively and document the consequences.

I agree with the first two suggestions, however we will not implement the third, because a user have an option to auto ack consumed messages.