TimelyDataflow/timely-dataflow

Deadlock with other synchronization primitives.

frankmcsherry opened this issue · 6 comments

Reported in TimelyDataflow/differential-dataflow#223, timely quickly hangs when other synchronization primitives are used, at least a barrier. The code below hangs rather promptly (seconds, with two workers).

cc @ryzhyk

extern crate timely;

use timely::communication::initialize::Configuration;
use timely::dataflow::InputHandle;
use timely::dataflow::ProbeHandle;
use timely::dataflow::operators::{Input, Probe};

use std::sync::{Arc, Barrier};

fn main() {

    let nworkers: usize = std::env::args().nth(1).unwrap().parse().unwrap();
    let progress_barrier = Arc::new(Barrier::new(nworkers));

    timely::execute(Configuration::Process(nworkers), move |worker| {

        let progress_barrier = progress_barrier.clone();

        let mut input = InputHandle::<usize,()>::new();
        let mut probe = ProbeHandle::new();
        worker.dataflow(|scope| {
            scope.input_from(&mut input)
                 .probe_with(&mut probe);
        });

        if worker.index() == 0 {
            for i in 0 .. {
                input.advance_to(i);
                eprintln!("i: {}", i);
                progress_barrier.wait();
                while probe.less_than(&i) {
                    worker.step_or_park(None);
                }
            }
        } else {
            input.close();
            for i in 0 .. {
                progress_barrier.wait();
                while probe.less_than(&i) {
                    worker.step_or_park(None);
                }
            }
        }
        eprintln!("done");
    }).unwrap();
}

Experimentation suggests that replacing the None timeouts with one second results in one second stalls but the computation continues. This implies that a wake-up is lost, rather than an internal timely "you should execute this operator" message. When the worker re-awakens, it correctly performs whatever is next required of it.

One theory is that the barrier may consume thread wake-ups, and that timely should be more defensive about ensuring that calls in to step_or_park() not assume that the thread token is available if there are messages to read. No such assumption has been found yet, but ... that would explain things nicely...

@ryzhyk suggests that there may be an issue with ArcPusher, with these two calls:

self.events
.send((self.index, Event::Pushed(1)));
// TODO : Perhaps this shouldn't be a fatal error (e.g. in shutdown).
// .expect("Failed to send message count");
self.pusher.push(element)

The first prompts the read of a channel, and once "consumed" the recipient's interest in the channel may evaporate. If the recipient consumes the message before the second call deposits anything worth reading, it may decide "false alarm" and lose interest.

Surely they should be inverted!

Awkwardly, the second call wraps a "buzzer" which is what wakes up the recipient thread when a message is sent. So, if the order is inverted the thread can be awakened, and then fail to find any thing in the events queue and go back to sleep.

It seems that we should first install the message, then alert the events queue that something is to be found, then wake up the thread.

We don't do that yet.

PR #298 just got an update to reflect the above, and it seems to cause the above to no longer hang.

cc @ryzhyk

Experimentation suggests that replacing the None timeouts with one second results in one second stalls but the computation continues. This implies that a wake-up is lost, rather than an internal timely "you should execute this operator" message. When the worker re-awakens, it correctly performs whatever is next required of it.

One theory is that the barrier may consume thread wake-ups, and that timely should be more defensive about ensuring that calls in to step_or_park() not assume that the thread token is available if there are messages to read. No such assumption has been found yet, but ... that would explain things nicely...

Sorry, Frank, I do not completely agree with your analysis (or at least it does not match my observations). I know that experiments confirm your theory, but I want to be sure that this is not just an (un)fortunate coincidence. I also have no problems switching to the timely barrier, but again I am worried that this way we will be concealing the bug, not actually fixing it.

While trying to understand this race, I instrumented timely to print detailed execution traces. After getting through the barrier, and before getting stuck the worker makes it through one or two park/unpark phases; hence this cannot be caused by a lost unpark. What I think the barrier really does in this scenario is it prevents other workers from starting processing for the next epoch and eventually waking up the stuck worker.

In other words, what I describe in TimelyDataflow/differential-dataflow#223 (comment) is what I actually observed in many runs, not just a hypothetical scenario.

PS. Haven't tried the PR yet; will look at it now.

The PR makes sense to me (based on my limited understanding of timely) and it does solve the race both in the repro example and in the original program that uncovered it.

Thanks for the quick fix!

Fixed by #298.