yandex-research/swarm

Backwards fault recovery

Opened this issue · 0 comments

Great work with this paper and congraturlations!

I had a quick question how disconnects are handled during a backwards pass. From the paper it seems that timeouts are only triggered on a forward pass. But during a backwards pass you need to return the gradients of a node's input to a node which has had that batch pass through it. I couldn't find any explanation on how this is handled in the paper and from what I see in the code, it seems just a random new expert is chosen, which doesn't seem to be a sound solution.

I was wondering if I am missing something.

Also, to replicate the results, what commands do you use to run the setup.py? Both install and build seem to be insufficient