Can hang on shutdown without explaining what it's waiting for
Opened this issue · 6 comments
Pixy hangs on shutdown after being used. In our case it's a single client connected to a single topic with a single partition, which produced and consumed a few messages.
Two example logs. First was SIGINT the second was SIGTERM, in case it matters.
^C2020-03-27 10:52:08.675328 Z info </service.0> "Shutting down"
2020-03-27 10:52:08.675378 Z info </service.0/default_pxy_stop.0> Started
2020-03-27 10:52:08.675427 Z info </default.0/adm_stop.0> Started
2020-03-27 10:52:08.675443 Z info </default.0/adm_stop.0> Stopped
2020-03-27 10:52:08.675446 Z info </default.0/prod_stop.0> Started
2020-03-27 10:52:08.675451 Z info </default.0/cons_stop.0> Started
2020-03-27 10:52:08.675523 Z info </default.0/cons.0/disp.0> "Shutting down"
2020-03-27 10:52:08.675491 Z info </default.0/prod_disp.0> "About to stop producer: pendingMsgCount=0"
2020-03-27 10:52:08.675570 Z info </default.0/prod_disp.0> "Stopping producer: pendingMsgCount=0"
2020-03-27 10:52:08.675542 Z info </default.0/cons.0/ussjc-bx-001.ts.example.com.0/disp.0> "Shutting down" kafka.group=ussjc-bx-001.ts.example.com
2020-03-27 10:52:08.675588 Z info "Producer shutting down." category=sarama
2020-03-27 10:52:08.675625 Z info </default.0/prod_merg.0> Stopped
2020-03-27 10:52:08.675639 Z info </default.0/prod_disp.0> Stopped
2020-03-27 10:52:08.675650 Z info "producer/broker/1001 input chan closed" category=sarama
2020-03-27 10:52:08.675665 Z info "producer/broker/1001 shut down" category=sarama
2020-03-27 10:52:08.675665 Z info </default.0/prod_stop.0> Stopped
2020-03-27 10:54:33.659139 Z info </service.0> "Shutting down"
2020-03-27 10:54:33.659175 Z info </service.0/default_pxy_stop.0> Started
2020-03-27 10:54:33.659194 Z info </default.0/prod_stop.0> Started
2020-03-27 10:54:33.659210 Z info </default.0/prod_disp.0> "About to stop producer: pendingMsgCount=0"
2020-03-27 10:54:33.659217 Z info </default.0/prod_disp.0> "Stopping producer: pendingMsgCount=0"
2020-03-27 10:54:33.659222 Z info </default.0/adm_stop.0> Started
2020-03-27 10:54:33.659226 Z info "Producer shutting down." category=sarama
2020-03-27 10:54:33.659240 Z info </default.0/adm_stop.0> Stopped
2020-03-27 10:54:33.659246 Z info </default.0/prod_merg.0> Stopped
2020-03-27 10:54:33.659251 Z info </default.0/prod_disp.0> Stopped
2020-03-27 10:54:33.659259 Z info "producer/broker/1001 input chan closed" category=sarama
2020-03-27 10:54:33.659269 Z info "producer/broker/1001 shut down" category=sarama
2020-03-27 10:54:33.659225 Z info </default.0/cons_stop.0> Started
2020-03-27 10:54:33.659278 Z info </default.0/cons.0/disp.0> "Shutting down"
2020-03-27 10:54:33.659285 Z info </default.0/cons.0/ussjc-bx-001.ts.example.com.0/disp.0> "Shutting down" kafka.group=ussjc-bx-001.ts.example.com
2020-03-27 10:54:33.659254 Z info </default.0/prod_stop.0> Stopped
What version are you using?
Have you acknowledged all consumed messages before stopping? Alternatively you could have used auto_ack=true when consuming. Before stopping kafka-pixy
waits for acknowledgments of consumed messages for proxies.<proxy-name>.consumer.ack_timeout which is 5 minutes by default.
What version are you using?
v0.17.0
Before stopping kafka-pixy waits for acknowledgments of consumed messages for proxies..consumer.ack_timeout which is 5 minutes by default.
That seems like a long time for a process to wait after a SIGTERM (especially without logging an explanatory message).
I'd expect kafka-pixy to stop promptly, dropping the connection, and for the client's Ack to fail with a gRPC error so the client would know the Ack didn't succeed.
What does "before retrying" mean in this context:
# Period of time that Kafka-Pixy should wait for an acknowledgement
# before retrying.
ack_timeout: 5m
That seems like a long time for a process to wait after a SIGTERM (especially without logging an explanatory message).
If kafka-pixy
stops immediately that would result in unacknowledged messages to be offered to another client later, possibly by another kafka-pixy
instance, hence it is advisable to set this parameter to be greater than the longest it takes to handle your messages.
What does "before retrying" mean in this context:
It means that if a message is not acknowledged in ack_timeout
time, it will be offered by kafka-pixy
again.
Thanks for the explanation. I agree it's working as designed.
My suggested changes are:
- clarify the wording of the
ack_timeout
comments by explaining the effect, both on messages and on shutdown. - if a shutdown is blocked waiting on
ack_timeout
log a message to that effect. - add an config option to shutdown more aggressively and document the consequences.
I agree with the first two suggestions, however we will not implement the third, because a user have an option to auto ack consumed messages.