ssl-hep/ServiceX

Celery-enabled transformers lose files due to queue auto-deletion

ponyisi opened this issue · 0 comments

Describe the bug
When run at large scale, transformation requests are lost because RabbitMQ queues are auto-deleted before all files are complete. This appears to be because the auto-delete behavior depends on whether there are consumers connected, not whether the queue is empty, and so if the workers disconnect while they are working on a long-lived task, RabbitMQ can decide that nobody is interested in the queues any more and delete them.

To Reproduce
Steps to reproduce the behavior:

  1. Run the ServiceX client 3.0 version of the CMS AGC grand challenge (iris-hep/analysis-grand-challenge#225) with no restriction on the number of files (i.e. the number of files to transform exceeds the number of ServiceX transformation workers). The jobs will hang part way through. The disappearance of the relevant queues while the jobs are still ongoing is verified using rabbitmqctl list_queues in the RabbitMQ pod, and when the auto_delete option is removed, the transformations fully complete as expected.

Expected behavior
Transformations complete (equivalently, the transformer queues are only deleted at the end of the transformation).