zeek/spicy

Writing overlapping data to sink segfaults (was: 7.0.0-rc1 Segmentation Fault)

Closed this issue · 5 comments

We've been experiencing infrequent crashes (maybe once per week) with a segmentation fault error.

Here's the crash report:

Zeek 7.0.0-rc1-debug
Linux 5.15.0-113-generic

Zeek plugins: (none found)

==== No reporter.log

==== stderr.log
warning in /srv/zeek/share/zeek/policy/tuning/defaults/__load__.zeek, line 1: deprecated script loaded from /srv/zeek/spool/installed-scripts-do-not-touch/site/local.zeek:13 "Remove in v7.1 The policy/tuning/defaults package is deprecated. The options set here are now the defaults for Zeek in general.";
listening on eno1

/srv/zeek/share/zeekctl/scripts/run-zeek: line 110: 613375 Segmentation fault      (core dumped) nohup ${pin_command} $pin_cpu "$myzeek" "$@"

==== stdout.log
max memory size             (kbytes, -m) unlimited
data seg size               (kbytes, -d) unlimited
virtual memory              (kbytes, -v) unlimited
core file size              (blocks, -c) unlimited

==== .cmdline
-i af_packet::eno1 -U .status -p zeekctl -p zeekctl-live -p local -p worker-barfleur-24 local.zeek zeekctl base/frameworks/cluster zeekctl/auto

==== .env_vars
PATH=/srv/zeek/bin:/srv/zeek/share/zeekctl/scripts:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
ZEEKPATH=/srv/zeek/spool/installed-scripts-do-not-touch/site::/srv/zeek/spool/installed-scripts-do-not-touch/auto:/srv/zeek/share/zeek:/srv/zeek/share/zeek/policy:/srv/zeek/share/zeek/site:/srv/zeek/share/zeek/builtin-plugins
CLUSTER_NODE=worker-barfleur-24

==== .status
RUNNING [run_loop]

==== No prof.log

==== No packet_filter.log

==== No loaded_scripts.log

And here's the core dump backtrace:

#0  0x0000000000000000 in ?? ()
zeek/zeek#1  0x00007ff3433a0301 in spicy::rt::Sink::_reportOverlap (this=0x7ff3287b6690, rseq=51, old=..., new_=...) at /usr/local/src/zeek-7.0.0-rc1/auxil/spicy/spicy/runtime/src/sink.cc:302
zeek/zeek#2  0x00007ff34339e21b in spicy::rt::Sink::_addAndCheck (this=0x7ff3287b6690, data=std::optional<hilti::rt::Bytes> = {...}, rseq=51, rupper=53,
  c={data = std::optional<hilti::rt::Bytes> = {[contained value] = {<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >> = "\214\261s\223_A6\352\020O\303I)]Я\353K\331\346\336\354\261\303\vM\253\v7k\232\276\204\300\272\017\314\352CH\364ٓq\003ز\203\000\202\352k\315+J9Fr\224", <incomplete sequence \357>, _control = std::shared_ptr<const std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > *> (empty) = {get() = 0x0}}}, rseq = 51, rupper = 111}) at /usr/local/src/zeek-7.0.0-rc1/auxil/spicy/spicy/runtime/src/sink.cc:84
zeek/zeek#3  0x00007ff34339f6d6 in spicy::rt::Sink::_newData (this=0x7ff3287b6690, data=std::optional<hilti::rt::Bytes> = {...}, rseq=42, len=11)
    at /usr/local/src/zeek-7.0.0-rc1/auxil/spicy/spicy/runtime/src/sink.cc:213
zeek/zeek#4  0x00007ff3433a1fcf in spicy::rt::Sink::write (this=0x7ff3287b6690, data=..., seq=std::optional<unsigned long> = {...}, len=std::optional<unsigned long> [no contained value])
    at /usr/local/src/zeek-7.0.0-rc1/auxil/spicy/spicy/runtime/src/sink.cc:463
zeek/zeek#5  0x00005564a7179b06 in __hlt_quic::QUIC::__hook_Frame___on_c_b67f (__dd=..., __self=...) at /usr/local/src/zeek-7.0.0-rc1/build/src/analyzer/protocol/quic/quic_QUIC.cc:4987
zeek/zeek#6  0x00005564a711d12c in __hlt_quic::QUIC::__hook_Frame___on_c (__dd=..., __self=...) at /usr/local/src/zeek-7.0.0-rc1/build/src/analyzer/protocol/quic/quic___linker__.cc:97
zeek/zeek#7  0x00005564a719cd7a in __hlt_quic::QUIC::Frame::__on_c (this=0x7ff30a7c4cd0, __dd=...) at /usr/local/src/zeek-7.0.0-rc1/build/src/analyzer/protocol/quic/quic_QUIC.cc:2111
zeek/zeek#8  0x00005564a7159c78 in __hlt_quic::QUIC::Frame::__parse_switch_6_case_3_stage1 (this=0x7ff30a7c4cd0, __data=..., __begin=..., __cur=..., __trim=..., __lah=..., __lahe=...,
    __error=std::optional<hilti::rt::RecoverableFailure> [no contained value]) at /usr/local/src/zeek-7.0.0-rc1/build/src/analyzer/protocol/quic/quic_QUIC.cc:2239
zeek/zeek#9  0x00005564a715b2b8 in __hlt_quic::QUIC::Frame::__parse_switch_6_stage1 (this=0x7ff30a7c4cd0, __data=..., __begin=..., __cur=..., __trim=..., __lah=..., __lahe=...,
    __error=std::optional<hilti::rt::RecoverableFailure> [no contained value]) at /usr/local/src/zeek-7.0.0-rc1/build/src/analyzer/protocol/quic/quic_QUIC.cc:2377
zeek/zeek#10 0x00005564a71589b4 in __hlt_quic::QUIC::Frame::__parse_QUIC__Frame_stage2 (this=0x7ff30a7c4cd0, __data=..., __begin=..., __cur=..., __trim=..., __lah=..., __lahe=...,
    __error=std::optional<hilti::rt::RecoverableFailure> [no contained value]) at /usr/local/src/zeek-7.0.0-rc1/build/src/analyzer/protocol/quic/quic_QUIC.cc:2139
zeek/zeek#11 0x00005564a7158ec4 in __hlt_quic::QUIC::Frame::__parse_stage1 (this=0x7ff30a7c4cd0, __data=..., __begin=..., __cur=..., __trim=..., __lah=..., __lahe=...,
    __error=std::optional<hilti::rt::RecoverableFailure> [no contained value]) at /usr/local/src/zeek-7.0.0-rc1/build/src/analyzer/protocol/quic/quic_QUIC.cc:2165
zeek/zeek#12 0x00005564a716c694 in __hlt_quic::QUIC::Packet::__parse_frames_stage1 (this=0x7ff310e7c610, __data=..., __begin=..., __cur=..., __trim=..., __lah=..., __lahe=...,
    __error=std::optional<hilti::rt::RecoverableFailure> [no contained value], __dst=...) at /usr/local/src/zeek-7.0.0-rc1/build/src/analyzer/protocol/quic/quic_QUIC.cc:3838
zeek/zeek#13 0x00005564a716b8b6 in __hlt_quic::QUIC::Packet::__parse_QUIC__Packet_stage2 (this=0x7ff310e7c610, __data=..., __begin=..., __cur=..., __trim=..., __lah=..., __lahe=...,
    __error=std::optional<hilti::rt::RecoverableFailure> [no contained value]) at /usr/local/src/zeek-7.0.0-rc1/build/src/analyzer/protocol/quic/quic_QUIC.cc:3774
zeek/zeek#14 0x00005564a716d1c4 in __hlt_quic::QUIC::Packet::__parse_stage1 (this=0x7ff310e7c610, __data=..., __begin=..., __cur=..., __trim=..., __lah=..., __lahe=...,
    __error=std::optional<hilti::rt::RecoverableFailure> [no contained value]) at /usr/local/src/zeek-7.0.0-rc1/build/src/analyzer/protocol/quic/quic_QUIC.cc:3907
zeek/zeek#15 0x00005564a716f30f in __hlt_quic::QUIC::RequestFrame::__parse_QUIC__RequestFrame_stage2 (this=0x7ff31287bd50, __data=..., __begin=..., __cur=..., __trim=..., __lah=..., __lahe=...,
    __error=std::optional<hilti::rt::RecoverableFailure> [no contained value]) at /usr/local/src/zeek-7.0.0-rc1/build/src/analyzer/protocol/quic/quic_QUIC.cc:4091
zeek/zeek#16 0x00005564a716f81e in __hlt_quic::QUIC::RequestFrame::__parse_stage1 (this=0x7ff31287bd50, __data=..., __begin=..., __cur=..., __trim=..., __lah=..., __lahe=...,
    __error=std::optional<hilti::rt::RecoverableFailure> [no contained value]) at /usr/local/src/zeek-7.0.0-rc1/build/src/analyzer/protocol/quic/quic_QUIC.cc:4117
zeek/zeek#17 0x00005564a716feee in __hlt_quic::QUIC::RequestFrame::parse1 (__data=..., __cur=std::optional<hilti::rt::stream::View> [no contained value], __context=std::optional<spicy::rt::UnitContext> = {...})
    at /usr/local/src/zeek-7.0.0-rc1/build/src/analyzer/protocol/quic/quic_QUIC.cc:4156
zeek/zeek#18 0x00005564a7183579 in operator() (__closure=0x7ff33a32be40, r=0x7ff33ba75a40) at /usr/local/src/zeek-7.0.0-rc1/build/src/analyzer/protocol/quic/quic_QUIC.cc:5509
zeek/zeek#19 0x00005564a718a940 in operator() (__closure=0x0, f=..., h=0x7ff33ba75a40) at /usr/local/src/zeek-7.0.0-rc1/auxil/spicy/hilti/runtime/include/hilti/rt/fiber.h:132
zeek/zeek#20 0x00005564a718a9ca in _FUN () at /usr/local/src/zeek-7.0.0-rc1/auxil/spicy/hilti/runtime/include/hilti/rt/fiber.h:131
zeek/zeek#21 0x00007ff3428122b4 in hilti::rt::detail::Callback::operator() (this=0x7ff33ba75a50, h=0x7ff33ba75a40) at /usr/local/src/zeek-7.0.0-rc1/auxil/spicy/hilti/runtime/include/hilti/rt/fiber.h:141
zeek/zeek#22 0x00007ff34280e13f in __fiber_run_trampoline (argsp=0x7ff33a32bfb0) at /usr/local/src/zeek-7.0.0-rc1/auxil/spicy/hilti/runtime/src/fiber.cc:95
zeek/zeek#23 0x00007ff34280de98 in fiber_bottom_abort (fiber=0x7ff342895e75 <fiber_asm_invoke>, args=0x7ff33a32bfb0) at /usr/local/src/zeek-7.0.0-rc1/auxil/spicy/hilti/runtime/src/fiber.cc:75
zeek/zeek#24 0x00007ff342896a47 in fiber_stack_alignment () at /usr/local/src/zeek-7.0.0-rc1/auxil/spicy/3rdparty/fiber/src/fiber.c:374
zeek/zeek#25 0x00007ff33a32bff0 in ?? ()
zeek/zeek#26 0x0000000000000000 in ?? ()

I had the following script running at the time to capture traffic surrounding the crash:

#!/bin/bash

# Define the network interface and output directory
INTERFACE="eno1"
OUTPUT_DIR="/home/jbarnart/zt"  # Change this to your desired directory
COREDUMP_DIR="/var/lib/apport/coredump"
PCAP_FILE="${OUTPUT_DIR}/traffic.pcap"

# Ensure the output directory exists
mkdir -p "$OUTPUT_DIR"

# Function to capture all traffic in rotating file
capture_all_traffic() {
  sudo tcpdump -i "$INTERFACE" -w "$PCAP_FILE" -G 120
}

# Monitor coredump directory for changes
monitor_coredump() {
  inotifywait -m -e create "$COREDUMP_DIR" |
  while read -r directory events filename; do
    # Get the current timestamp
    crash_time=$(date +"%Y-%m-%d %H:%M:%S")
    echo "Detected change in coredump directory: $filename at $crash_time"

    # Output the timestamp to a file
    echo "Crash detected at: $crash_time" > "${OUTPUT_DIR}/crash_timestamp.txt"

    # Wait for 30 seconds to capture post-crash traffic
    sleep 30

    # Stop the tcpdump process
    sudo pkill tcpdump

    # Exit the script
    exit 0
  done
}

# Run the capture and monitor in parallel
capture_all_traffic &
monitor_coredump &
wait

So, I should have the traffic that was being processed at the time of the crash. However, running through zeek -r does not cause a crash, just:

1721339566.531225 error: string with embedded NUL: "SSH-2.0-OpenSSH_8.9p1 Ubuntu-3ubuntu0.10 vs \x00\x00\x00\x0c"
1721339576.255936 error: string with embedded NUL: "SSH-2.0-OpenSSH_8.9p1 Ubuntu-3ubuntu0.10 vs \x00\x00\x04<"
1721339578.956443 error: string with embedded NUL: "SSH-2.0-OpenSSH_8.9p1 Ubuntu-3ubuntu0.10 vs \x00\x00\x00\x0c"

And, thinking it might have something to do with cluster mode or our configuration, I tried running it through tcpreplay, but that didn't cause a crash either.

Thanks @JordanBarnartt , moving this to zeek/spicy.

The simple reproducer below is from the docs, but modified such that overlapping data is passed to a sink - that triggers an immediate segfault. This could happen with QUIC if there's duplicated UDP packets (maybe?), or possibly a crafted QUIC packets with overlapping crypto frames.

So, I should have the traffic that was being processed at the time of the crash.

So, this would still be very interesting to have! Does your tcpdump report dropped packets? Can you double check eno1 is the right interface? To limit to just QUIC traffic, the following might also work tcpdump -i INTERFACE 'udp and port 443'.

This is where sink.write() is in the QUIC analyzer:
https://github.com/zeek/zeek/blob/e982a18792c3327c04a05efe64b92b573b1a17f9/src/analyzer/protocol/quic/QUIC.spicy#L282-L284

$ cat overlap.spicy 
module Test;

public type Foo = unit {

    sink data;

    on %init {
        self.data.connect(new Bar);
        self.data.write(b"567", 5);
        self.data.write(b"89", 6);
    }
};

public type Bar = unit {
    s: bytes &eod;
    on %done { print self.s; }
};

$ echo -n "" | spicy-driver ./overlap.spicy  -p Test::Foo
Segmentation fault (core dumped)

$ coredumpctl debug
Core was generated by `spicy-driver ./overlap.spicy -p Test::Foo'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000000000000000 in ?? ()
(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x00007f0492182d76 in spicy::rt::Sink::_reportOverlap (this=this@entry=0x5629e5ee04a0, rseq=6, old=..., new_=...) at /home/awelzel/corelight-oss/zeek/auxil/spicy/spicy/runtime/src/sink.cc:302
#2  0x00007f0492182099 in spicy::rt::Sink::_addAndCheck (this=this@entry=0x5629e5ee04a0, data=std::optional = {...}, rseq=rseq@entry=6, rupper=rupper@entry=8, c=...) at /home/awelzel/corelight-oss/zeek/auxil/spicy/spicy/runtime/src/sink.cc:84
#3  0x00007f04921846f5 in spicy::rt::Sink::_newData (this=0x5629e5ee04a0, data=std::optional = {...}, rseq=6, len=<optimized out>) at /home/awelzel/corelight-oss/zeek/auxil/spicy/spicy/runtime/src/sink.cc:213
#4  0x00007f049218762d in spicy::rt::Sink::write (this=<optimized out>, data=..., seq=..., len=...) at /home/awelzel/corelight-oss/zeek/auxil/spicy/spicy/runtime/src/sink.cc:463
#5  0x00007f04921e4b53 in __hlt::Test::__hook_Foo___on_0x25_init_a846 (__self=...) at /tmp/Test_1027ac8db248b14e-250ce9354021d034.cc:419
#6  0x00007f04921e2891 in __hlt::Test::Foo::__on_0x25_init (this=0x5629e5f1a380) at /tmp/Test_1027ac8db248b14e-250ce9354021d034.cc:280

Before #1736 which landed with v1.11.0-dev-147-g206e411c6 we would fail here with an exception,

[error] terminating with uncaught exception of type std::__1::bad_function_call: std::exception

In that patch we we switched to a different infra with less checks so this inconsistent state is not caught anymore. Building and running the reproducer with 206e411 and ASAN enabled reports the following:

$ spicy-driver -d /tmp/bla.spicy -f /dev/zero -p Test::Foo
ninja: no work to do.
AddressSanitizer:DEADLYSIGNAL
=================================================================
==25931==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x000000000000 bp 0x00010df19810 sp 0x00010df19560 T0)
==25931==Hint: pc points to the zero page.
==25931==The signal is caused by a READ memory access.
==25931==Hint: address points to the zero page.
    #0 0x0  (<unknown module>)
    #1 0x1028c5ec0 in spicy::rt::Sink::_addAndCheck(std::__1::optional<hilti::rt::Bytes>, unsigned long long, unsigned long long, std::__1::__list_iterator<spicy::rt::Sink::Chunk, void*>) sink.cc:84
    #2 0x1028cf568 in spicy::rt::Sink::_newData(std::__1::optional<hilti::rt::Bytes>, unsigned long long, unsigned long long) sink.cc:213
    #3 0x1028da000 in spicy::rt::Sink::write(hilti::rt::Bytes, std::__1::optional<unsigned long long>, std::__1::optional<unsigned long long>) sink.cc:463
    #4 0x10db9a1a4 in __hlt::Test::__hook_Foo___on_0x25_init_7d97(hilti::rt::ValueReference<__hlt::Test::Foo>&)+0x5ac (__library__f630738a0984c2a4.hlto:arm64+0x161a4)
    #5 0x10db86794 in __hlt::Test::__hook_Foo___on_0x25_init(hilti::rt::ValueReference<__hlt::Test::Foo>&)+0x8 (__library__f630738a0984c2a4.hlto:arm64+0x2794)
    #6 0x10db92d3c in __hlt::Test::Foo::__parse_stage1(hilti::rt::ValueReference<hilti::rt::Stream>&, hilti::rt::stream::SafeConstIterator const&, hilti::rt::stream::View, hilti::rt::Bool, SafeInt<long long, hilti::rt::integer::detail::SafeIntException>, hilti::rt::stream::SafeConstIterator, std::__1::optional<hilti::rt::RecoverableFailure>)+0x590 (__library__f630738a0984c2a4.hlto:arm64+0xed3c)
    #7 0x10db9847c in __hlt::Test::Foo::parse3(hilti::rt::ValueReference<spicy::rt::ParsedUnit>&, hilti::rt::ValueReference<hilti::rt::Stream>&, std::__1::optional<hilti::rt::stream::View> const&, std::__1::optional<spicy::rt::UnitContext> const&)+0xb4c (__library__f630738a0984c2a4.hlto:arm64+0x1447c)
    #8 0x10dbca7d8 in hilti::rt::detail::Callback::Callback<hlt::Test::Foo::parse3(hilti::rt::ValueReference<spicy::rt::ParsedUnit>&, hilti::rt::ValueReference<hilti::rt::Stream>&, std::__1::optional<hilti::rt::stream::View> const&, std::__1::optional<spicy::rt::UnitContext> const&)::$_5>(hlt::Test::Foo::parse3(hilti::rt::ValueReference<spicy::rt::ParsedUnit>&, hilti::rt::ValueReference<hilti::rt::Stream>&, std::__1::optional<hilti::rt::stream::View> const&, std::__1::optional<spicy::rt::UnitContext> const&)::$_5)::'lambda'(linb::any const&, hilti::rt::detail::Fiber*)::__invoke(linb::any const&, hilti::rt::detail::Fiber*)+0x1bc (__library__f630738a0984c2a4.hlto:arm64+0x467d8)
    #9 0x107660434 in hilti::rt::detail::Callback::operator()(hilti::rt::detail::Fiber*) const fiber.h:142
    #10 0x10765f058 in __fiber_run_trampoline fiber.cc:95
    #11 0x107805000 in fiber_asm_invoke fiber_asm_aarch64_apcs.S:83

==25931==Register values:
 x[0] = 0x000000010bae0b30   x[1] = 0x0000000000000006   x[2] = 0x000000010df19d40   x[3] = 0x000000010df19d90
 x[4] = 0x000000010df19460   x[5] = 0x000000010df19828   x[6] = 0x02000001028c8c50   x[7] = 0x0000000000000001
 x[8] = 0x0000000000000000   x[9] = 0x0000000000000006  x[10] = 0x0000000000000018  x[11] = 0x000000010df19580
x[12] = 0x0000007021c032c6  x[13] = 0x000000010df19568  x[14] = 0x000000010df19560  x[15] = 0x0000007021c033ce
x[16] = 0x00000001028d4e84  x[17] = 0x0000000100be45d0  x[18] = 0x0000000000000000  x[19] = 0x000000010df19660
x[20] = 0x0000007000020000  x[21] = 0x0000007021c03614  x[22] = 0x000000010df1b1b0  x[23] = 0x000000010af280d8
x[24] = 0x000000010df1b0c0  x[25] = 0x000000010df1b0d7  x[26] = 0x0000000021be361a  x[27] = 0x0000000021be3618
x[28] = 0x000000010df1b0a0     fp = 0x000000010df19810     lr = 0x00000001028c985c     sp = 0x000000010df19560
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV (<unknown module>)
==25931==ABORTING
[3]    25931 abort      spicy-driver -d /tmp/bla.spicy -f /dev/zero -p Test::Foo

we would fail here with an exception,

Ah, but that was unhandled and took down the process, too? @JordanBarnartt hit that one before, too: zeek/zeek#3780

we would fail here with an exception,

Ah, but that was unhandled and took down the process, too? @JordanBarnartt hit that one before, too: zeek/zeek#3780

Yes, it looks like Zeek's Spicy driver does not catch std::exception so we would have taken down the Zeek process before as well. I'd still consider going from terminate on exception1 to a nullptr deref a regression since the former is controlled (though undesirable) behavior, while the latter is invoking UB.

Footnotes

  1. for uncaught exceptions std::terminate is called which in turn will call std::abort

I'd still consider going from terminate on exception1 to a nullptr deref a regression since the former is controlled (though undesirable) behavior.

Agree, though the backtrace for the null-deref actually showed where the issue was, while the prior rethrow didn't quite help. Maybe some gdb-exception-wizzarding could've helped to figure out where the original exception had been thrown.