project8/dragonfly

crashed psyllid blocking other channels

Closed this issue · 3 comments

A single crashed (completely unresponsive) psyllid channel can block data taking on all other channels. This needs to be removed with highest priority.

Here's the failure mode:

  1. In DAQProvider.start_timed_run, there is a call to the DAQ-specific _do_checks
  2. One of those checks is ROACH1ChAcquisitionInterface._check_psyllid_instance
  3. This method calls PsyllidProvider.get_active_channels, which checks the status of all psyllid channels and throws an error if the desired one isn't active.
  1. This method is doubly-confusing, because all get_active_channels does is iterate through all channels calling request_status (which was already done in _check_psyllid_instance) to update the dict, and then return the dict.
  • What is this last check even accomplishing?

I think removing that will make this less stateful, and more resilient.

I thought when a channel is crashed request-status would result in a "message not deliverable" error? Is that no longer the case? Or are we talking about the case, when the instance is not crashed but unresponsive?

Anyway, the purpose of this call, is not to check that the psyllid instance is active. That is accomplished by the status request a few lines above. Instead, this check is supposed to make sure that the psyllid instance, the psyllid interface and the channel daq interface share the same channel id definitions. I think you're right and it can be removed. We only set channel ids and stream labels via the config files and then never touch them during operation.

I will make a branch and remove this check

There's probably some subtlety to the crash state where the queue remains valid but completely unresponsive. So instead of immediately returning undeliverable, it waits for a timeout.

I agree that since we never mess with the mapping the PR looks right.

fixed in #195, released in v1.17.2