gossip panic: switch on corrupt value - SocketAddr from ContactInfo during buildPullRequests
Closed this issue · 0 comments
dnut commented
Description
When running gossip on mainnet, it eventually panics due to switching on a corrupt SocketAddr within a ContactInfo. It may take a few minutes for the error to occur, but it always happens eventually.
thread 9468759 panic: switch on corrupt value
/Users/drew/mine/code/sig/src/net/net.zig:135:17: 0x100ecc3ef in eql (sig)
switch (self.*) {
^
/Users/drew/mine/code/sig/src/gossip/data.zig:1001:32: 0x100ea013b in getSocket (sig)
if (self.cache[key].eql(&SocketAddr.UNSPECIFIED)) {
^
/Users/drew/mine/code/sig/src/gossip/service.zig:1701:60: 0x100f8b1db in getGossipNodes__anon_28001 (sig)
const peer_gossip_addr = contact_info.getSocket(SOCKET_TAG_GOSSIP);
^
/Users/drew/mine/code/sig/src/gossip/service.zig:924:44: 0x100f8be33 in buildPullRequests (sig)
var peers = try self.getGossipNodes(
^
/Users/drew/mine/code/sig/src/gossip/service.zig:702:53: 0x100f9280b in buildMessages (sig)
var packets = self.buildPullRequests(
^
/opt/homebrew/Cellar/zig/0.11.0/lib/zig/std/Thread.zig:433:13: 0x100f519f7 in callFn__anon_25266 (sig)
@call(.auto, f, args) catch |err| {
^
/opt/homebrew/Cellar/zig/0.11.0/lib/zig/std/Thread.zig:685:30: 0x100f2d93b in entryFn (sig)
return callFn(f, args_ptr.*);
^
???:?:?: 0x18c546033 in ??? (libsystem_pthread.dylib)
???:?:?: 0xb26000018c540e3b in ??? (???)
How to Reproduce the Bug
git checkout 69b9a8e871698371902cb1b60a1d3f046a502c4d # current main
zig build run -- -l info gossip \
--entrypoint 34.83.231.102:8001 \
--entrypoint 145.40.67.83:8001 \
--entrypoint 147.75.38.117:8001 \
--entrypoint 145.40.93.177:8001 \
--entrypoint 86.109.15.59:8001
Additional Context
I tried commenting out bincode.free
in GossipTable.remove
, but it didn't help. My hypothesis was that the unsafe free was causing undefined behavior when memory was accessed after being freed.