TimelyDataflow/diagnostics

Diagnostics PageRank Example Stuck

Closed this issue · 3 comments

Hello. I was trying to follow the example on the README.md but tdiag gets stuck.

  1. One one terminal I execute: cargo run --release -- --source-peers 2 graph --out graph.html
  2. One second terminal I execute: env TIMELY_WORKER_LOG_ADDR="127.0.0.1:51317" cargo run --release --example pagerank 1000 100000 -w 2
  3. pagerank runs to completion.
  4. tdiag acknowledges connection via:
Listening for 2 connections on 127.0.0.1:51317
Trace sources connected
Press enter to generate graph (this will crash the source computation if it hasn't terminated).
  1. I press enter but tdiag hangs indefinitely.

Looking at the stack trace of tdiag there are two threads. The main thread is waiting on a thread join. Thread2 also seems stuck on await_events. Stack trace for Thread2:

futex_wait_cancelable 0x00007ffff7d9c376
__pthread_cond_wait_common 0x00007ffff7d9c376
__pthread_cond_wait 0x00007ffff7d9c376
std::sys::unix::condvar::Condvar::wait condvar.rs:73
std::sys_common::condvar::Condvar::wait condvar.rs:50
std::sync::condvar::Condvar::wait condvar.rs:200
std::thread::park mod.rs:923
<timely_communication::allocator::thread::Thread as timely_communication::allocator::Allocate>::await_events thread.rs:44
<timely_communication::allocator::generic::Generic as timely_communication::allocator::Allocate>::await_events generic.rs:99
timely::worker::Worker<A>::step_or_park worker.rs:216
timely::execute::execute::{{closure}} execute.rs:206
timely_communication::initialize::initialize_from::{{closure}} initialize.rs:269
std::sys_common::backtrace::__rust_begin_short_backtrace backtrace.rs:130
std::thread::Builder::spawn_unchecked::{{closure}}::{{closure}} mod.rs:475
<std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once panic.rs:318
std::panicking::try::do_call panicking.rs:297
__rust_try 0x000055555661a74d
std::panicking::try panicking.rs:274
std::panic::catch_unwind panic.rs:394
std::thread::Builder::spawn_unchecked::{{closure}} mod.rs:474
core::ops::function::FnOnce::call_once{{vtable-shim}} function.rs:232
<alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once boxed.rs:1034
<alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once boxed.rs:1034
std::sys::unix::thread::Thread::new::thread_start thread.rs:87
start_thread 0x00007ffff7d95609
clone 0x00007ffff7ed1103

Please lmk if I missed something when executing the commands?

utaal commented

Hi, thanks for the report.
I can't reproduce this on my Mac (on the timely-dataflow and diagnostics master branch); are you on Linux?

Okay. I found the problem. I was using my own Cargo patch version of timely-dataflow. I had extended the TimelyEvent enum with my own variant. That seems to cause the locking behavior (even though I don't use the variant anywhere in the code...).

utaal commented

Right! Yes, the messages are encoded using https://github.com/TimelyDataflow/abomonation, which is essentially just writing out the in-memory representation of rust objects. This means that even a minor difference in the definition of Events can cause things to break in unexpected ways.

If you need a custom event, you either need to change TimelyEvent in both the timely program and in connect (within diagnostics) by changing its dependency to point to your custom timely, or (maybe preferably) you can just register a separate event stream using something like https://github.com/TimelyDataflow/timely-dataflow/blob/4b4752ec7447253ac61c9804c9cf6406dabfc281/timely/examples/logging-send.rs#L21-L23

You can see how to make a receiver here: https://github.com/TimelyDataflow/diagnostics/blob/master/tdiag/src/commands/graph.rs#L51-L53

You can also customise how events are written out, by registering a different callback to the log_register: this way you can emit, for example, bincode or json-encoded events. We use the binary abomonation encoding because it's a lot faster than json and, afaik, faster than bincode (which is important for chatty logging streams like TimelyEvent).