PandABlocks/PandABlocks-client

How to get data frames immediately, instead of one frame late?

Closed this issue · 2 comments

Hi Team PandABlocks

At MAX IV, we are trying to implement a near "real-time" indicator of the number of samples received from the PandABox via the data capture port. We are currently underestimating the expected value while capturing.

Symptoms

Simplified code is something like this:

client = BlockingClient(host)
client.connect()

num_samples_acquired = 0
for data in client.data(scaled=True):
    print(f"** data: {data}")
    if isinstance(data, ReadyData):
        client.send(Arm())
    elif isinstance(data, FrameData):
        num_samples_acquired += len(data.data)
        print(f"** samples acquired: {num_samples_acquired}")
    elif isinstance(data, EndData):
        break
print("done")

If we trigger the PandABox PCAP block slowly (on the order of seconds), we see the counter is always 1 behind the number of triggers. This continues until we disarm, when we get the remaining data frames and the correct count. We have flush_every_frame=True.

As an example, consider 2 triggers, each a few seconds apart. (The >> lines are actions we performed via the web interface).

Output we get:

** data: ReadyData()
** data: StartData(...)
>> Trigger PCAP
>> Trigger PCAP
** data: FrameData(...)
** samples acquired: 1
>> Disarm PCAP
** data: FrameData(...)
** samples acquired: 2
** data: EndData(samples=2, reason=<EndReason.DISARMED: 'Disarmed'>)
done

Output we would like:

** data: ReadyData()
** data: StartData(...)
>> Trigger PCAP
** data: FrameData(...)
** samples acquired: 1
>> Trigger PCAP
** data: FrameData(...)
** samples acquired: 2
>> Disarm PCAP
** data: EndData(samples=2, reason=<EndReason.DISARMED: 'Disarmed'>)
done

Analysis

After some digging, this seems to be due to the feature of handling data overruns, added in the commit: 067c4f4. If I understand it correctly, in _handle_data_frame() the new data frame is added to _pending_data, and the previous data frame is added to _partial_data. The flush() methods looks a _partial_data, so a flush from _handle_data_frame() will always be one frame behind. Finally, in _handle_data_end(), we combine the pending and partial data which will gives us the remaining frames (unless there was a data overrun).

This implementation makes sense for writing data to a file, where the file will only be accessed after the PCAP block was disarmed, and all data is finalised.

In _handle_data_frame(), the easy way to get the data frames immediately is to skip the pending data step. Obviously, that has the downside of invalid data on the last frame if there's a data overrun.

--- a/pandablocks/connections.py
+++ b/pandablocks/connections.py
@@ -359,8 +359,7 @@ class DataConnection:
         # we already read "BIN ", so read the rest
         data = self._buf.read_bytes(length - 4)[4:]
         # The pending data is now valid, and what we got is pending
-        self._partial_data += self._pending_data
-        self._pending_data = data
+        self._partial_data += data
         # if told to flush now, then yield what we have
         if self._flush_every_frame:
             yield from self.flush()

In the case of an overrun, I was wondering how we manage to get all the data frame bytes, len("BIN " + 4_bytes_encoded_length + data)), from the data socket, and the "END " data, but the last frame is corrupted. That led me to PandaBlock-server code. If I understand it correctly, passthrough_capture_block() can send a corrupt data frame, while process_capture_block avoids sending the corrupt data frame. If that is so, then we only need the pending data + discard on the client side if we are using DataConnection.connect(scaled=False) (the passthrough mode).

Suggestion

Could we skip pending data + discard when using DataConnection.connect(scaled=True)? That would suit our use case. Our data rate is comparatively low and we're using the scaled processing.

Does this seems like a workable solution to you? If so, we can provide a pull request.

Versions

PandABlocks-client version: 0.5.0
PandABlocks-server version: ~2.1.0 (+ PCAP ARM timestamp)

P.s. Your codebase is really easy to read! Well done, and thanks for making such nice code 👍

coretl commented

First of all, thank you for making such a detailed and well thought out issue, it is amazing to have both the context and the analysis in one place!

Your analysis is correct, technically we only need this partial -> pending -> published stage in the scaled=Falsecase. However, I am wondering if it is too complicated in the first place. If we get a data overrun then the scan is failed anyway, so if we publish bad data then that isn't the end of the world. Maybe we should just always publish the data promptly, and document that if we get an EndData with DATA_OVERRUN then the last piece of data may be invalid.

If you're happy with this, then I'd be happy to accept a PR that removes the pending stage universally, and a bit of documentation about EndData and DATA_OVERRUN producing invalid data with scaled=False.

OK, @coretl - simplifying it sounds good. I'll open a PR.