Assert failed in `bolt-32host-fattree.cc`
Closed this issue · 6 comments
I ran this experiment with following parameters:
./waf --run "scratch/bolt-32host-fattree --load=0.8 --traceQueues --traceFlowStats"
But it ran into an assert failed:
Running the Simulation...
assert failed. cond="(targetMsg->GetSrcAddress() == ipv4Header.GetDestination()) && (targetMsg->GetDstAddress() == ipv4Header.GetSource()) && (targetMsg->GetSrcPort() == boltHeader.GetDstPort()) && (targetMsg->GetDstPort() == boltHeader.GetSrcPort())", +1.019101584s -1 file=../src/internet/model/bolt-l4-protocol.cc, line=1343
terminate called without an active exception
I used a fork to make the code run under G++ 13. Could you give it a try? Thanks.
UPD: The same issue is found when running bolt-64host-fattree.cc
.
UPD2: The same issue is found when running every experiment needing workload. But I used the workflow file from Homa repo.
Hi,
Thank you for showing interest in our work.
The assertion failure you get is thrown whenever the received control packet on an end-host doesn't belong to that end-host. In other words, the source or destination port numbers on the incoming control packet should be different. This was certainly not an error we received before. I am suspecting the new compiler version is impacting how port numbers are assigned by new Bolt sockets at the end-hosts.
Unfortunately I am not able to rerun these experiments these days. I can recommend rerunning them while confirming the socket port numbers on each end-host. Then, you will probably need to readjust portNoStart
at this link to match that.
Please let me know if you get any new insights.
Hi and thanks for your reply.
I'm digging into this issue. Now I found that:
- Readjusting
portNoStart
to 500 or 10,000 has no help. - I add some lines to make the assert work in a fine-grain way. And It shows that the wrong condition appears in the
targetMsg
destination comparing withipv4Header
source (targetMsg: 10.0.20.1:1020 vs ipv4Header: 10.0.30.1:1030). - Failed on g++-13, g++-11, g++-8 and clang++-11
Note that the port numbers are matching the ip addresses, so the issue is not about the port number assignment.
At this line, all the incoming packets are verified that their destination address and port are correct. Then I am guessing that the way outbound messages are stored here is buggy that doesn't maintain the right information for every message. Would you be able to inspect that?
Thanks. I will inspect that.
I noticed that the txMsgId
is a uint16_t
value. Will it be too small for the message ID field? I will add a sanitizer to check this.
UPD: Cannot compile after adding -fsanitize=undefined
, maybe the ns-3 version is too old...
UPD2: Yes, I can't say whether it is a hash collision, but the uint16_t
is too small and encounters a hash collision here.uint16_t
rewind happens. Tested with adding a collision detection at Line 1291:
if (m_outboundMsgs.find(txMsgId) != m_outboundMsgs.end())
NS3_ASSERT_MSG(false, "hash collision!");
And the result is:
assert failed. cond="false", msg="hash collision!", +1.019101584s -1 file=../src/internet/model/bolt-l4-protocol.cc, line=1293
terminate called without an active exception
I'm going to increase uint16_t
to uint64_t
to see whether it will be OK.
I noticed that txMsgId
should be an int
value according to here (and a lot of code likewise). So I think it's good to increase uint16_t
to uint32_t
. Maybe also rxMsgId
should do this.
UPD3: Using uint32_t
asserts more quickly than using uint16_t
. And the message collides at txMsgId
is 0. Weird.
The two collide flow likes:
A->B txMsgId=0
...
B->C txMsgId=17
and the flow B->C
receives flow A->B
's Bolt header. The packet header is like:
tos 0x0 DSCP Default ECN Not-ECT ttl 0 id 0 protocol 196 offset (bytes) 0 flags [DF] length: 43 10.0.9.1 > 10.0.29.1
length: 23 1009 > 1029 txMsgId: 0 seqAckNo: 0 hopCnt: 0 reflectedDelay: 15258 drainTime: 831 BTS|FIN|LAST|FIRST|DECWIN|AI|LINK100G
The packet seems like an SRC packet.
UPD4: The Nope. I changed txMsgId
of the crashed packet is 0
. So maybe it's a corrupt packet? I'm not sure about it, but "the packet is corrupt" seems possible, like we should set txMsgId
somewhere, but we didn't.load
to 0.5, and the crashed txMsgId
is 12. So the conclusion I have is txMsgId
should upgrade to uint32_t
. But needs a lot of scrupulous work. Also, I have no idea about the issue UPD3 mentioned. It's quite weird and it should be noticed.
UPD5: I added a UBsan and it warns two types of messages:
../src/traffic-control/model/pfifo-bolt-queue-disc.cc:128:16: runtime error: signed integer overflow: -2147483648 + -593 cannot be represented in type 'int'
../src/traffic-control/model/pfifo-bolt-queue-disc.cc: runtime error: member call on misaligned address 0x55c1031b33f6 for type 'struct BoltHeaderPtr', which requires 4 byte alignment
0x55c1031b33f6: note: pointer points here
...
For the first one, some value is truly corrupted, like m_availLoad
is a negative value. For the second one, unsafe pointer cast is used everywhere in pfifo-bolt-queue-disc.cc
. I'm not sure about if it's the reason that cause this issue. The line here is a very dangerous operation, which ignores the memory alignment in struct
and can cause a data corruption.
Hi @serhatarslan-hub . After following patches, I'm able to run the experiment:
- Avoid
reinterpret_cast
usage - Upgrade
uint16_t tx/rxMsgId
touint32_t
And bolt-32host-fattree
logs:
Running the Simulation...
Total utilization: 2518.45Gbps
99%ile queue size: 4.032usec (50400 Bytes)
Time taken by simulation: 22.2 minutes
Though Total utilization
is weird (maybe should divided by nhost
), getting it running again is good news. Cheers!
Please refer to this commit. Thanks for your help!