project8/dragonfly

remove roach DAQ threshold cache

Closed this issue · 4 comments

The stored_threshold of the roach_daq_run_interface doesn't seem to serve any purpose. If the user wants to know the threshold, psyllid can be queried directly. Psyllid now starts up with the correct trigger settings, so this only catches a mismatch when (a) the roach daq service restarts or (b) the psyllid threshold is directly set without dragonfly.

If (a) is the goal, why not create a variable which is more clearly named, self.has_reset which init's to True and is set to False when the daq system is prepared or something. Or make sure that the ROACH DAQ service isn't stateful and doesn't care if it resets (which is probably already true?)

And (b) seems like an acceptable user choice, to directly set psyllid without using the convenience method.

Having a self.has_reset seems like another state variable that doesn't automatically change if psyllid re-activates. The re-activation is checked by comparing the central frequencies of the roach and psyllid.

The threshold mismatch at the moment is not necessary anymore, because psyllid has the same default trigger settings. And I agree that applying the trigger settings via the channel daq interface is redundant now.
But I disagree that (b) is an acceptable option. Most people are probably not aware of how to do the trigger settings in psyllid correctly, and the channel daq interface takes care of that for the operators.
In addition, if we ever want to take data with a different trigger configuration (and I am sure we will for some reason), then how do we prevent that psyllid restarts and continues the next run with an undesired trigger configuration, if we remove this check?

I would even say that maybe we need to do more trigger setting checks. It used to be enough to check one threshold, because psyllids default threshold was so unrealistic that something had to be wrong if a run was started with this threshold still set. Now, a run specific trigger configuration could have the same thresholds but other time window settings for example.

I am not sure how we should solve this.

There are four possible restarts that could give rise to different failure modes, and will need different catches:

  • psyllid
    • psyllid restarts with the wrong central frequency, defaulting to 50 MHz
      • the ROACH will never use a CF that low, so this will always flag the psyllid restart
      • psyllid needs the CF because it loads into the egg header, and this is an effective flag of restart, even though it is stateful
    • psyllid restarts with the correct default trigger setting (since we changed config file)
      • if we're running a special trigger, the CF check will alert to a psyllid restart, but we'll need to be careful to properly sync them in our script
  • roach_daq_run_interface crash
    • what important state information is actually cached here?
    • currently the threshold variable is a catch, but I don't think we care?
  • roach2_interface
    • caches the central frequency, will be lost in a restart
    • overwrites the fft_shift_vector on restart
    • will we definitely see if this restarts?
  • psyllid_interface
    • is there important state information cached here? probably not

I think the fundamental question is where do we have cached information that might be important? Does it need to be cached? And if so, how do we catch it?

here's what currently happens when sth restarts:

  • psyllid:

    • psyllid comes up activated so the status check does not test a restart
    • before every run the cfs of the roach and psyllid are compared. if psyllid restarted, there will be a frequency mismatch.
    • there would also be a trigger threshold mismatch (if acquisition mode is triggered), but the _do_checks method doesn't get to it, because the frequency match is tested first.
  • psyllid_interface:

    • if the psyllid interface is restarted, the cached information is lost. that is no longer a problem, because all settings are requested from psyllid before being used or returned.
  • roach_daq_run_interface:

    • cached information is the trigger threshold. after a restart it is zero. the result of calling _do_checks after a restart would therefore be a threshold mismatch.
  • roach2_interface:

    • when restarted, the roach is re-programmed and the default central frequency is set to all channels. this frequency is 800MHz. trying to start a run would give a frequency mismatch.
    • cached settings are the blocked channels, the fft shift vectors and the central frequency.
    • after a restart, all channels are unblocked
    • I checked r2daq and the central frequency is a cached value there too. So there is currently no way to get rid of that. But we could decide we dont want to set a default frequency after the restart. In that case the returned cf from the roach2_interface would be None.
    • if a restart happened during a run, the roach would probably stop streaming packets while it is being reprogrammed.
  • the roach:

    • the roach is pinged before a run is started. so at least it cannot be unresponsive to ping.
    • no other check is done on the roach status before a run. the central frequency that is compared by the roach_daq_run_interface is stored information and not re-requested before a run start.
    • _do_checks would not detect sth is wrong and a run would be started.
    • the roach would not be programmed with the bitcode and therefore would not be streaming packets. when psyllid starts a run and is not receiving packets, it crashes (or raises an error. I am not sure, which one it is, but I definitely remember that sth bad happens (which is good in this case)).
    • this would also cause a psyllid crash when a new mask is recorded. so as long as the restart doesn't happen between the mask recording and the run, an execution script would crash at the make_mask command.

this brings me to these questions/conclusions:

  • I don't see a way to not have cached information in the roach2_interface. but at least, if the roach or its interface restarted, the data taking is interrupted. I am not aware that these two have ever caused trouble during data taking (at least not spontaneously. the roach has of course sometimes refused service for weeks and months in a row.)

  • We could just re-set the central frequency before every run... This way, a psyllid restart would not crash the execution script. It sounds dangerous though, to not notice crashes...

  • I don't know what to do about the threshold mismatch. I do think trigger settings is sth we should check before a run, especially as we sometimes see secret psyllid restarts. It is annoying to have that in normal operations, when the settings are default to psyllid too. But if we ever take data at different run conditions with a different trigger, not having this check would cause trouble.

  • I advocate we add a pretrigger time check before the run too. If someone wants to take low pressure data, and leaves the thresholds as is, but changes the pretrigger time, the threshold check wouldn't tell something is wrong.

We should have a note, probably here about how/why whichever resolution was chosen. The PR is in line with one of my suggestions, but somewhat against your last comment. My recollection of discussion on the last call was that we should revisit this.