Syndica/sig

gossip panic: switch on corrupt value - SocketAddr from ContactInfo during buildPullRequests

Closed this issue · 0 comments

dnut commented

Description

When running gossip on mainnet, it eventually panics due to switching on a corrupt SocketAddr within a ContactInfo. It may take a few minutes for the error to occur, but it always happens eventually.

thread 9468759 panic: switch on corrupt value
/Users/drew/mine/code/sig/src/net/net.zig:135:17: 0x100ecc3ef in eql (sig)
        switch (self.*) {
                ^
/Users/drew/mine/code/sig/src/gossip/data.zig:1001:32: 0x100ea013b in getSocket (sig)
        if (self.cache[key].eql(&SocketAddr.UNSPECIFIED)) {
                               ^
/Users/drew/mine/code/sig/src/gossip/service.zig:1701:60: 0x100f8b1db in getGossipNodes__anon_28001 (sig)
            const peer_gossip_addr = contact_info.getSocket(SOCKET_TAG_GOSSIP);
                                                           ^
/Users/drew/mine/code/sig/src/gossip/service.zig:924:44: 0x100f8be33 in buildPullRequests (sig)
        var peers = try self.getGossipNodes(
                                           ^
/Users/drew/mine/code/sig/src/gossip/service.zig:702:53: 0x100f9280b in buildMessages (sig)
                var packets = self.buildPullRequests(
                                                    ^
/opt/homebrew/Cellar/zig/0.11.0/lib/zig/std/Thread.zig:433:13: 0x100f519f7 in callFn__anon_25266 (sig)
            @call(.auto, f, args) catch |err| {
            ^
/opt/homebrew/Cellar/zig/0.11.0/lib/zig/std/Thread.zig:685:30: 0x100f2d93b in entryFn (sig)
                return callFn(f, args_ptr.*);
                             ^
???:?:?: 0x18c546033 in ??? (libsystem_pthread.dylib)
???:?:?: 0xb26000018c540e3b in ??? (???)

How to Reproduce the Bug

git checkout 69b9a8e871698371902cb1b60a1d3f046a502c4d  # current main
zig build run -- -l info gossip \
    --entrypoint 34.83.231.102:8001 \
    --entrypoint 145.40.67.83:8001 \
    --entrypoint 147.75.38.117:8001 \
    --entrypoint 145.40.93.177:8001 \
    --entrypoint 86.109.15.59:8001

Additional Context

I tried commenting out bincode.free in GossipTable.remove, but it didn't help. My hypothesis was that the unsafe free was causing undefined behavior when memory was accessed after being freed.