Sync does not seem to block - Exceptions/computations stuck when running in distributed mode
utaal opened this issue · 9 comments
Hi again,
I've encountered another couple of issues when running https://gist.github.com/utaal/96c4328fbc48a07d69b4 on mono.
First, in distributed mode with two processes, process 1
does not seem to actually block on Sync
. The following is the output of process 1
.
Logging initialized to console
Pushing data
Syncing 0
00:00:02.7292913, Syncing at epoch (0) that is in the future of __Input0
Synced 0
Syncing 1
00:00:02.7400550, Syncing at epoch (1) that is in the future of __Input0
Synced 1
Syncing 2
00:00:02.7401329, Syncing at epoch (2) that is in the future of __Input0
Synced 2
Syncing 3
00:00:02.7401982, Syncing at epoch (3) that is in the future of __Input0
Synced 3
Syncing 4
00:00:02.7407855, Syncing at epoch (4) that is in the future of __Input0
Synced 4
Syncing 5
00:00:02.7408682, Syncing at epoch (5) that is in the future of __Input0
Synced 5
Syncing 6
00:00:02.7409424, Syncing at epoch (6) that is in the future of __Input0
Synced 6
Syncing 7
00:00:02.7410222, Syncing at epoch (7) that is in the future of __Input0
Synced 7
Syncing 8
00:00:02.7410966, Syncing at epoch (8) that is in the future of __Input0
Synced 8
Syncing 9
00:00:02.7411738, Syncing at epoch (9) that is in the future of __Input0
Synced 9
(...)
I was watching the output of process 0
and by this time the subscription did not even finish outputting data for epoch 1
.
Is this behavior somehow expected? Do you believe it may be mono
-specific?
Moreover, occasionally, when running with -t
> 1 we either get the following exception mid-computation - all the processes stop working immediately after this - or the computation just gets stuck.
00:00:03.2077718, Graph 0 failed on scheduler 0 with exception:
System.Collections.Generic.KeyNotFoundException: The given key was not present in the dictionary.
at System.Collections.Generic.Dictionary`2[Microsoft.Research.Naiad.Runtime.Progress.Pointstamp,System.Int64].get_Item (Pointstamp key) [0x00000] in <filename unknown>:0
at Microsoft.Research.Naiad.Runtime.Progress.ProgressUpdateAggregator.ConsiderFlushingBufferedUpdates () [0x00000] in <filename unknown>:0
at Microsoft.Research.Naiad.Runtime.Progress.ProgressUpdateAggregator.OnRecv (System.Collections.Generic.Dictionary`2 deltas) [0x00000] in <filename unknown>:0
at Microsoft.Research.Naiad.Runtime.Progress.ProgressUpdateProducer.Start () [0x00000] in <filename unknown>:0
at Microsoft.Research.Naiad.Scheduling.Scheduler.RunWorkItem (Int32 graphId) [0x00000] in <filename unknown>:0
at Microsoft.Research.Naiad.Scheduling.Scheduler.InternalStart () [0x00000] in <filename unknown>:0
00:00:03.2084448, Cancelling execution of graph 0, due to exception:
System.Collections.Generic.KeyNotFoundException: The given key was not present in the dictionary.
at System.Collections.Generic.Dictionary`2[Microsoft.Research.Naiad.Runtime.Progress.Pointstamp,System.Int64].get_Item (Pointstamp key) [0x00000] in <filename unknown>:0
at Microsoft.Research.Naiad.Runtime.Progress.ProgressUpdateAggregator.ConsiderFlushingBufferedUpdates () [0x00000] in <filename unknown>:0
at Microsoft.Research.Naiad.Runtime.Progress.ProgressUpdateAggregator.OnRecv (System.Collections.Generic.Dictionary`2 deltas) [0x00000] in <filename unknown>:0
at Microsoft.Research.Naiad.Runtime.Progress.ProgressUpdateProducer.Start () [0x00000] in <filename unknown>:0
at Microsoft.Research.Naiad.Scheduling.Scheduler.RunWorkItem (Int32 graphId) [0x00000] in <filename unknown>:0
at Microsoft.Research.Naiad.Scheduling.Scheduler.InternalStart () [0x00000] in <filename unknown>:0
I apologize for the missing file names / line numbers, I was not able to reproduce the issue when running with --debug
.
Thank you.
Andrea
I've commented out a few Tracing annotations/calls in release_0.5
(to make it compile under mono) and ran the aforementioned computation - I'm experiencing the same issues.
We're running the latest mono
, 3.10.0
:
Mono JIT compiler version 3.10.0 (tarball Wed Oct 22 10:06:03 BST 2014)
Copyright (C) 2002-2014 Novell, Inc, Xamarin Inc and Contributors. www.mono-project.com
TLS: normal
SIGSEGV: altstack
Notification: kqueue
Architecture: amd64
Disabled: none
Misc: softdebug
LLVM: supported, not enabled.
GC: sgen
Interesting. I'll take a peek. The multi-threaded crash is not something I've ever seen before, and suggests that some of the locking is behaving differently under mono than on Windows; just about everything in ProgressUpdateAggregator.ConsiderFlushingBufferedUpdates
is under a lock of some sort (though this.BufferedUpdates
gets captured elsewhere, and the locking design is pretty horrible here).
The first issue (not syncing) should be able to be tracked down pretty simply. Historically, it has been because not all processes advance their inputs, which doesn't seem to be the case in your gist, so, neat! This would explain why the subscription doesn't fire (if some part of Naiad thinks there is more data due from the input, it won't fire the subscription). I'll explore!
I may be a bit slow, in that my development environment has recently changed radically, and I'm not yet fluent with mono. :)
Hi Andrea,
Here are some quick comments, which may vary in the level of satisfaction they produce. :)
I recommend switching the computation.sync(i);
to the following (yes, I know I wrote that, sorry!):
var subscription = computation.NewInput(in1)
.TestStep1()
.TestStep2()
.Subscribe( /* stuff goes in here */);
and then further along, call subscription.sync(i);
. What seems to be happening is that the warning being printed is spurious. When you call computation.sync(i);
Naiad checks to see if you might be making a mistake by calling sync
without having supplied enough inputs. It does this in a bad and lazy way (asking each input if they've received the corresponding epoch of inputs) and of course they haven't because it's a race. Calling sync
directly on the subscription assumes you know what you are doing and just waits appropriately.
Next, It seems like there is some difference between windows and mono with respect to event signaling. At least, worker threads appear to remain stuck waiting to be woken to process notifications for your two vertices (which do nothing, but still cause Naiad to coordinate and wait). One fix seems to be to use the --broadcastwakeup
flag, which switches thread wake-up to one based on an event count object. When I do this, I don't get the threads stalling. We'll still look into it and see if we can find the bad behavior causing the difference, but if this helps you in the meantime, excellent.
Finally, no clue about the dictionary exception issue at the moment. I have not reproduced it yet; if you have a set-up which seems like it causes it to happen somewhat regularly, I'd be delighted to get the details. I suspect in the future we'll be re-tooling some of this code, and making it less break-prone. At the moment, it seems like the only access to this.BufferedUpdates
by key is guarded by a lock, and a if (this.BufferedUpdates.ContainsKey(frontier[i]))
, so it is a bit mysterious. There is another part of the code which clears the dictionary, but only after it has been detached from this.BufferedUpdates
while under the same lock. Oof.
I think I've tracked down a candidate for the exception-throwing in ProgressUpdateAggregator
. I can't really test it because I can't cause it to fail, but:
It looks like this.Notifications
, the dictionary tracking outstanding local notifications, is read under lock(this.Lock)
but mutated under a different lock. If this is actually the dictionary that is throwing, we should be able to fix that by extending the use of lock(this.Lock)
from line 124 down to line 141 in UpdateAggregator.cs
.
We'll try and push a new version soon, but if you make the suggested edit (extend the lock to cover the updates to this.Notifications
) and see if the crashes vanish, that would help too!
Thanks for the bug!
Thank you so much for the writeup and the fix. I apologize for not getting back to you earlier, other university projects took up all my time.
The crash seems to have in fact vanished (we've upgraded to 0.5), however - it seems like that Sync(epoch)
is still not behaving as I would expect. It seems to wait for the entire epoch to be processed if I use the overload for Subscribe
that accepts a Action<int, IEnumerable<R>>
(so every worker receives part of the output). It doesn't seem to matter whether I use Computation.Sync
or Subscription.Sync
. Is this expected?
Hi Andrea,
The intended behavior for Subscription.Sync(epoch)
for this overload is that it blocks waiting for the entirety of records associated with epoch
to be received before invoking the supplied action. This should mean waiting on the inputs which can reach the subscription in the dataflow graph, and any records they may produce. Computation.Sync
should block on every subscription reaching epoch epoch
.
I can't tell if this is in line with what you are seeing, or if the problem is that it is waiting longer than you would have hoped / expected. Could you say a bit more about which one it is? :)
There is a more expressive overload (with actions for each of OnRecv
, OnNotify
, and OnCompleted
, where the first is called as data are received). This is the one "closest to the metal" of what the underlying dataflow vertex sees, and might give a clearer sense of when Naiad is triggering notifications, and whether this corresponds to when you would expect them or not.
Frank
Edit: I should say, the intended behavior of Subscription.Sync(epoch)
is independent of the overload used. The overloads provide different implementations of the Subscribe vertex, but Sync(epoch)
should just block until all local instances (vertices) of the stage have responded that they have been notified for epoch
.
Thanks for the quick reply and sorry for the lack of clarity in the previous message.
I would have expected for .Sync
to behave independently of the overload of .Subscribe
used, however it seems like that if I use the basic .Subscribe(..., Action<IEnumerable<R>>)
, .Sync(epoch)
does not wait for all workers to have completed all the records associated with epoch
. In other words, only forcing the subscription to deliver output on every worker seems to guarantee that .Sync
waits for an epoch
to be completed across the cluster.
Independently of whether I call .Sync
on the subscription or computation, if I use the basic overload for Subscribe
.Sync
does not block and consequently I get a steady stream of warning messages Syncing at epoch (x) in the future of y.
.
Hopefully this makes more sense now.
Ah, I see. Yes, the intended behavior of Sync(epoch)
is that it blocks until all process-local instances of the vertex have fired their notifications. That means that if there are no process-local instances of the vertex, it will not block. The overload of Subscribe
that uses only an Action<IEnumerable<R>>
is just such a vertex, since it moves all records to one instance of the vertex at process 0.
So, in that sense it is "expected behavior", but may not be what you want. We were a bit conflicted on whether things like Sync
should do barrier-style synchronization, across all processes, or whether the dataflow-style synchronization (block until locally complete) was more expressive. It should be the case that you can get out a barrier implementation by using the Subscribe
whose action takes an additional int argument (or two) indicating the location, and whose implementation installs vertices on each process. Note: you don't actually have to deliver output to each worker, you just have to threaten to do it; putting a PartitionBy
just in front of the subscription will make sure the data go to the one instance you want.
This is definitely an area where we are learning a bit about our intended use, and others' expected behavior, and where they don't line up. Ideally we'd like to get to something clean and predictable, without introducing synchronization in cases where the user doesn't want or need it.
Thank you for the clarification. We've figured out the barrier-style synchronization trick and are using it at the start of our computation as we need to push a lot of (almost-)static context to various nodes before we begin accepting real-time data (our current approach is to push a couple of epochs at the beginning with such data and .Sync()
before connecting to the real-time data source). This might in fact be abusing the computation model - however it seemed like a reasonable compromise with having to move the state around out of band.