[Bug] Hang when using ProxyChan and one GPU is sending zeros bytes.
FC-Li opened this issue · 3 comments
FC-Li commented
The GPU who is sending zeros bytes will push a ProxyTrigger
with zero value fst
.
Proxy's loop in proxy.cc will skip handle this trigger.
trigger = fifo.poll();
if (trigger.fst == 0 || trigger.snd == 0) { // TODO: this check is a potential pitfall for custom triggers
continue; // there is one in progress
}
trigger.snd ^= ((uint64_t)1 << (uint64_t)63); // this is where the last bit of snd is reverted.
ProxyHandlerResult result = handler(trigger); // SKIPPED!!!!!!!!!!!!!!!!!!!
When skip happened this GPU's counterpart will not be signaled. This is because Host2DeviceSemaphore::signal -> IBConnection::updateAndSync will not be called. The counterpart will hang at Host2DeviceSemaphoreDeviceHandle's wait
MSCCLPP_DEVICE_INLINE void wait(int64_t maxSpinCount = 100000000) {
(*expectedInboundSemaphoreId) += 1;
POLL_MAYBE_JAILBREAK((atomicLoad(inboundSemaphoreId, memoryOrderAcquire) < (*expectedInboundSemaphoreId)),
maxSpinCount);
}
FC-Li commented
@Binyang2014 Is this an known issue?