microsoft/mscclpp

[Bug] Hang when using ProxyChan and one GPU is sending zeros bytes.

FC-Li opened this issue · 3 comments

FC-Li commented

The GPU who is sending zeros bytes will push a ProxyTrigger with zero value fst.

Proxy's loop in proxy.cc will skip handle this trigger.

      trigger = fifo.poll();
      if (trigger.fst == 0 || trigger.snd == 0) {  // TODO: this check is a potential pitfall for custom triggers
        continue;                                  // there is one in progress
      }
      trigger.snd ^= ((uint64_t)1 << (uint64_t)63);  // this is where the last bit of snd is reverted.

      ProxyHandlerResult result = handler(trigger);  // SKIPPED!!!!!!!!!!!!!!!!!!!

When skip happened this GPU's counterpart will not be signaled. This is because Host2DeviceSemaphore::signal -> IBConnection::updateAndSync will not be called. The counterpart will hang at Host2DeviceSemaphoreDeviceHandle's wait

  MSCCLPP_DEVICE_INLINE void wait(int64_t maxSpinCount = 100000000) {
    (*expectedInboundSemaphoreId) += 1;
    POLL_MAYBE_JAILBREAK((atomicLoad(inboundSemaphoreId, memoryOrderAcquire) < (*expectedInboundSemaphoreId)),
                         maxSpinCount);
  }
FC-Li commented

@Binyang2014 Is this an known issue?

@FC-Li This is expected as we don't define behavior of put-ing zero bytes (which will make fst == 0). We may need to handle this in a better way, but do you have any use cases for put-ing zero bytes, which is a no-op by definition?

@chhwang
It's just a corner case.

I solved it by

if (send_bytes > 0) {
    proxyChan.putWithSignal(....);
} else {
    proxyChan.signal(...);
}