project8/dragonfly

ROACH trigger setting logging flexibility

Opened this issue · 2 comments

We log the time_window_settings and trigger_settings methods of the roach_daq_run_interface. Originally we had separate psyllid and dragonfly daq channel configs for these two settings, and would toggle both in step. Since switching psyllid to kubernetes, the dragonfly daq keeps trying to log an unavailable psyllid method and throws long errors.

Should dragonfly daq provider and psyllid run together in the same pod structure (kubernetes has a model for this), and does this give the right restart behavior?
Can the daq provider method be altered to be more tolerant while not losing sensitivity to failure modes we care about?

Currently it requires operator intervention when changing to the streaming config:
dragonfly set r2_channel_{X}_time_window.schedule_status off -b myrna.p8
dragonfly set r2_channel_{X}_trigger_settings.schedule_status off -b myrna.p8

We don't do streaming very often, so this may be a rare edge case.

I don't think I've fully understood this issue yet but:

  • If daq provider and psyllid should have coupled lifecycles then it makes sense to put them into the same pod.
  • If one of them crashes, the default behavior in k8s is to restart only that container, not every container in the pod.
  • If the pod specification is changed/updated, the new helm release will replace the entire pod, which means restarting both/all containers

.... I'm not sure if this fully addresses the questions above

does a liveness probe control a pod-level or container-level restart?