Diagnostics PageRank Example Stuck
Closed this issue · 3 comments
Hello. I was trying to follow the example on the README.md but tdiag gets stuck.
- One one terminal I execute:
cargo run --release -- --source-peers 2 graph --out graph.html
- One second terminal I execute:
env TIMELY_WORKER_LOG_ADDR="127.0.0.1:51317" cargo run --release --example pagerank 1000 100000 -w 2
- pagerank runs to completion.
- tdiag acknowledges connection via:
Listening for 2 connections on 127.0.0.1:51317
Trace sources connected
Press enter to generate graph (this will crash the source computation if it hasn't terminated).
- I press enter but tdiag hangs indefinitely.
Looking at the stack trace of tdiag there are two threads. The main thread is waiting on a thread join. Thread2 also seems stuck on await_events
. Stack trace for Thread2:
futex_wait_cancelable 0x00007ffff7d9c376
__pthread_cond_wait_common 0x00007ffff7d9c376
__pthread_cond_wait 0x00007ffff7d9c376
std::sys::unix::condvar::Condvar::wait condvar.rs:73
std::sys_common::condvar::Condvar::wait condvar.rs:50
std::sync::condvar::Condvar::wait condvar.rs:200
std::thread::park mod.rs:923
<timely_communication::allocator::thread::Thread as timely_communication::allocator::Allocate>::await_events thread.rs:44
<timely_communication::allocator::generic::Generic as timely_communication::allocator::Allocate>::await_events generic.rs:99
timely::worker::Worker<A>::step_or_park worker.rs:216
timely::execute::execute::{{closure}} execute.rs:206
timely_communication::initialize::initialize_from::{{closure}} initialize.rs:269
std::sys_common::backtrace::__rust_begin_short_backtrace backtrace.rs:130
std::thread::Builder::spawn_unchecked::{{closure}}::{{closure}} mod.rs:475
<std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once panic.rs:318
std::panicking::try::do_call panicking.rs:297
__rust_try 0x000055555661a74d
std::panicking::try panicking.rs:274
std::panic::catch_unwind panic.rs:394
std::thread::Builder::spawn_unchecked::{{closure}} mod.rs:474
core::ops::function::FnOnce::call_once{{vtable-shim}} function.rs:232
<alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once boxed.rs:1034
<alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once boxed.rs:1034
std::sys::unix::thread::Thread::new::thread_start thread.rs:87
start_thread 0x00007ffff7d95609
clone 0x00007ffff7ed1103
Please lmk if I missed something when executing the commands?
Hi, thanks for the report.
I can't reproduce this on my Mac (on the timely-dataflow
and diagnostics
master branch); are you on Linux?
Okay. I found the problem. I was using my own Cargo patch
version of timely-dataflow. I had extended the TimelyEvent
enum with my own variant. That seems to cause the locking behavior (even though I don't use the variant anywhere in the code...).
Right! Yes, the messages are encoded using https://github.com/TimelyDataflow/abomonation
, which is essentially just writing out the in-memory representation of rust objects. This means that even a minor difference in the definition of Event
s can cause things to break in unexpected ways.
If you need a custom event, you either need to change TimelyEvent
in both the timely program and in connect
(within diagnostics
) by changing its dependency to point to your custom timely
, or (maybe preferably) you can just register a separate event stream using something like https://github.com/TimelyDataflow/timely-dataflow/blob/4b4752ec7447253ac61c9804c9cf6406dabfc281/timely/examples/logging-send.rs#L21-L23
You can see how to make a receiver here: https://github.com/TimelyDataflow/diagnostics/blob/master/tdiag/src/commands/graph.rs#L51-L53
You can also customise how events are written out, by registering a different callback to the log_register
: this way you can emit, for example, bincode
or json
-encoded events. We use the binary abomonation
encoding because it's a lot faster than json
and, afaik, faster than bincode
(which is important for chatty logging streams like TimelyEvent
).