crashed psyllid blocking other channels

Question

crashed psyllid blocking other channels

Closed this issue 5 years ago · 3 comments

A single crashed (completely unresponsive) psyllid channel can block data taking on all other channels. This needs to be removed with highest priority.

Here's the failure mode:

In DAQProvider.start_timed_run, there is a call to the DAQ-specific _do_checks
One of those checks is ROACH1ChAcquisitionInterface._check_psyllid_instance
This method calls PsyllidProvider.get_active_channels, which checks the status of all psyllid channels and throws an error if the desired one isn't active.

https://github.com/project8/dragonfly/blob/develop/dragonfly/implementations/roach_daq_run_interface.py#L110
If a single channel is crashed, that call will timeout, start_run crashes, and we can't take data.

This method is doubly-confusing, because all get_active_channels does is iterate through all channels calling request_status (which was already done in _check_psyllid_instance) to update the dict, and then return the dict.

What is this last check even accomplishing?

I think removing that will make this less stateful, and more resilient.

Answer 1 · 2019-06-06T07:36:57.000Z

I thought when a channel is crashed request-status would result in a "message not deliverable" error? Is that no longer the case? Or are we talking about the case, when the instance is not crashed but unresponsive?

Anyway, the purpose of this call, is not to check that the psyllid instance is active. That is accomplished by the status request a few lines above. Instead, this check is supposed to make sure that the psyllid instance, the psyllid interface and the channel daq interface share the same channel id definitions. I think you're right and it can be removed. We only set channel ids and stream labels via the config files and then never touch them during operation.

I will make a branch and remove this check

Answer 2 · 2019-06-06T18:58:59.000Z

There's probably some subtlety to the crash state where the queue remains valid but completely unresponsive. So instead of immediately returning undeliverable, it waits for a timeout.

I agree that since we never mess with the mapping the PR looks right.

Answer 3 · 2019-06-10T05:29:32.000Z

fixed in #195, released in v1.17.2