m-labs/artiq

Increase in synchronous RPC times from ~3ms (ARTIQ v7) to ~40ms (ARTIQ v8)

clayton-ho opened this issue · 7 comments

Bug Report

One-Line Summary

Updated our system (i.e. flashed kasli firmware/gateware and upgraded the host software) from ARTIQ v7 to ARTIQ v8 and found that synchronous RPC times increased from a very consistent ~3ms on ARTIQ v7 to a very variable ~45ish ms.

Issue Details

Steps to Reproduce

Noticed it while running experiment code, but can be reproduced by profiling synchronous RPC times, e.g. using the following code snippet.

@kernel
def run(self):
    self.core.reset()

    for i in range(self.repetitions):
        self.core.break_realtime()

        # simulate pulse sequence
        self.urukul2_ch1.sw.on()
        delay_mu(self.time_delay_mu)
        self.urukul2_ch1.sw.off()

        # RPC call
        time_start_mu = self.core.get_rtio_counter_mu()
        self.rpc_call()
        time_stop_mu = self.core.get_rtio_counter_mu()
        self.update_results(time_stop_mu - time_start_mu)

@rpc
def rpc_call(self) -> TNone:
    return

Expected Behavior

Ideally, the delays would be fairly close to ~3ms, or at least <10ms, since a typical pulse sequence is ~15ms for us.

Actual (undesired) Behavior

Profiled the hardware time overheads on both our ARTIQ v7 and ARTIQ v8 systems:

  • Note: during actual sequences, the synchronous RPC overhead w/ v8 seems a lot more variable than is shown here - typically anywhere from ~3ms to 50ms
image

Tried playing around to try and localize the problem:

  • As expected, adding an async flag to the RPC call produces no timing overhead at all
  • Even an empty synchronous RPC call (i.e. immediately returns) incurs the extra overhead
  • Tested v7 firmware/gateware using both v7 and v8 host software, and also tested v8 firmware/gateware using both v7 and v8 host software
    • extra RPC overhead only happens with v8 firmware/gateware, and is independent of host software
    • figures shown above use release-8, commit 4147870

Your System (omit irrelevant parts)

  • Operating System: Windows 10
  • ARTIQ v7 firmware/gateware: 7.8208.38c72fd
  • ARTIQ v8 firmware/gateware: 8.8962+4147870
  • Host software: ARTIQ v8.0+unknown.beta
  • Hardware involved:
    • ucla2 variant (from m-labs)
    • kasli v2.0

It might be worth looking at a Wireshark capture to see if there is anything strange happening/to get another layer of timestamps to correlate. I wonder whether this is related to the network stack (e.g. Nagle/delayed ACK)…

@dnadlinger sorry for the late response, our experiment has been running so I haven't found the time to investigate it further, will try to investigate it this week

in the meantime, I had a friend try to replicate the issue on their v8 kasli-soc system (previous figures were for kasli v2.0) - results are attached.

Hardware involved:

  • ucla3 variant (from m-labs)
  • kasli-SoC (firmware/gateware: 8.8892+86a8d2d)
image

Attached is a wireshark packet capture taken during an RPC profiling experiment (experiment file included), as well as the HDF result file (all zipped up).

rpc_profiling_2024_11_09.zip

Your System (omit irrelevant parts)

  • Operating System: Windows 10
  • ARTIQ firmware/gateware: 8.8962+4147870
  • Host software: ARTIQ v8.0+unknown.beta

Hardware involved:

  • ucla2 variant (from m-labs)
  • kasli v2.0

Seems to be related to these changes in release-8...release-7
image
After applying these reverts the RPC latency went back to normal.

I'll investigate deeper to find out which change exactly caused this

Follow up: the c60de48 seems to cause this issue. After reapplying it to the above mentioned reverts the RPC latency has increased by around 40ms (with smoltcp v0.8 and related changes the latency increased).
@mbirtwell @hartytp

I think @dnadlinger called it. The nagle algorithm was added to smoltcp between 0.6 and 0.8. I've disabled that on all sockets. I think this will only have affected RPC calls whilst there was an unacknowledged async RPC. You can see this affect in the data @clayton-ho posted because the first RPC you timed (before doing any async RPCs) was fast and all the others were slow.

implemented this fix on out system and RPC times are back to ~1ms!

thanks for all the help @dnadlinger @thomasfire @mbirtwell