serhatarslan-hub/bolt_cc_ns3

Assert failed in `bolt-32host-fattree.cc`

Closed this issue · 6 comments

I ran this experiment with following parameters:

./waf --run "scratch/bolt-32host-fattree --load=0.8 --traceQueues --traceFlowStats"

But it ran into an assert failed:

Running the Simulation...
assert failed. cond="(targetMsg->GetSrcAddress() == ipv4Header.GetDestination()) && (targetMsg->GetDstAddress() == ipv4Header.GetSource()) && (targetMsg->GetSrcPort() == boltHeader.GetDstPort()) && (targetMsg->GetDstPort() == boltHeader.GetSrcPort())", +1.019101584s -1 file=../src/internet/model/bolt-l4-protocol.cc, line=1343
terminate called without an active exception

I used a fork to make the code run under G++ 13. Could you give it a try? Thanks.

UPD: The same issue is found when running bolt-64host-fattree.cc.

UPD2: The same issue is found when running every experiment needing workload. But I used the workflow file from Homa repo.

Hi,

Thank you for showing interest in our work.

The assertion failure you get is thrown whenever the received control packet on an end-host doesn't belong to that end-host. In other words, the source or destination port numbers on the incoming control packet should be different. This was certainly not an error we received before. I am suspecting the new compiler version is impacting how port numbers are assigned by new Bolt sockets at the end-hosts.

Unfortunately I am not able to rerun these experiments these days. I can recommend rerunning them while confirming the socket port numbers on each end-host. Then, you will probably need to readjust portNoStart at this link to match that.

Please let me know if you get any new insights.

Hi and thanks for your reply.

I'm digging into this issue. Now I found that:

  1. Readjusting portNoStart to 500 or 10,000 has no help.
  2. I add some lines to make the assert work in a fine-grain way. And It shows that the wrong condition appears in the targetMsg destination comparing with ipv4Header source (targetMsg: 10.0.20.1:1020 vs ipv4Header: 10.0.30.1:1030).
  3. Failed on g++-13, g++-11, g++-8 and clang++-11

Note that the port numbers are matching the ip addresses, so the issue is not about the port number assignment.
At this line, all the incoming packets are verified that their destination address and port are correct. Then I am guessing that the way outbound messages are stored here is buggy that doesn't maintain the right information for every message. Would you be able to inspect that?

Thanks. I will inspect that.

I noticed that the txMsgId is a uint16_t value. Will it be too small for the message ID field? I will add a sanitizer to check this.

UPD: Cannot compile after adding -fsanitize=undefined, maybe the ns-3 version is too old...

UPD2: Yes, uint16_t is too small and encounters a hash collision here. I can't say whether it is a hash collision, but the uint16_t rewind happens. Tested with adding a collision detection at Line 1291:

if (m_outboundMsgs.find(txMsgId) != m_outboundMsgs.end())
    NS3_ASSERT_MSG(false, "hash collision!");

And the result is:

assert failed. cond="false", msg="hash collision!", +1.019101584s -1 file=../src/internet/model/bolt-l4-protocol.cc, line=1293
terminate called without an active exception

I'm going to increase uint16_t to uint64_t to see whether it will be OK.

I noticed that txMsgId should be an int value according to here (and a lot of code likewise). So I think it's good to increase uint16_t to uint32_t. Maybe also rxMsgId should do this.

UPD3: Using uint32_t asserts more quickly than using uint16_t. And the message collides at txMsgId is 0. Weird.

The two collide flow likes:

A->B txMsgId=0
...
B->C txMsgId=17

and the flow B->C receives flow A->B's Bolt header. The packet header is like:

tos 0x0 DSCP Default ECN Not-ECT ttl 0 id 0 protocol 196 offset (bytes) 0 flags [DF] length: 43 10.0.9.1 > 10.0.29.1
length: 23 1009 > 1029 txMsgId: 0 seqAckNo: 0 hopCnt: 0 reflectedDelay: 15258 drainTime: 831 BTS|FIN|LAST|FIRST|DECWIN|AI|LINK100G

The packet seems like an SRC packet.

UPD4: The txMsgId of the crashed packet is 0. So maybe it's a corrupt packet? I'm not sure about it, but "the packet is corrupt" seems possible, like we should set txMsgId somewhere, but we didn't. Nope. I changed load to 0.5, and the crashed txMsgId is 12. So the conclusion I have is txMsgId should upgrade to uint32_t. But needs a lot of scrupulous work. Also, I have no idea about the issue UPD3 mentioned. It's quite weird and it should be noticed.

UPD5: I added a UBsan and it warns two types of messages:

../src/traffic-control/model/pfifo-bolt-queue-disc.cc:128:16: runtime error: signed integer overflow: -2147483648 + -593 cannot be represented in type 'int'
../src/traffic-control/model/pfifo-bolt-queue-disc.cc: runtime error: member call on misaligned address 0x55c1031b33f6 for type 'struct BoltHeaderPtr', which requires 4 byte alignment
0x55c1031b33f6: note: pointer points here
...

For the first one, some value is truly corrupted, like m_availLoad is a negative value. For the second one, unsafe pointer cast is used everywhere in pfifo-bolt-queue-disc.cc. I'm not sure about if it's the reason that cause this issue. The line here is a very dangerous operation, which ignores the memory alignment in struct and can cause a data corruption.

Hi @serhatarslan-hub . After following patches, I'm able to run the experiment:

  1. Avoid reinterpret_cast usage
  2. Upgrade uint16_t tx/rxMsgId to uint32_t

And bolt-32host-fattree logs:

Running the Simulation...
Total utilization: 2518.45Gbps
99%ile queue size: 4.032usec (50400 Bytes)
Time taken by simulation: 22.2 minutes

Though Total utilization is weird (maybe should divided by nhost), getting it running again is good news. Cheers!

Please refer to this commit. Thanks for your help!