thu-pacman/self-checkpoint

checkpoint restore on fresh node

Opened this issue · 0 comments

Hi guys,

Thanks a lot for sharing this great idea along with the code.

I am pretty new I this area, got one question: After one node failed (for example power off), we have a fresh node as a replacement, other nodes all quickly resume their states from ckpt stored in SHM, but how the fresh node get its ckpt? Is there any implications for example the ckpt on different are same so the fresh node can get it from any of other nodes?

Apologies if this is a really naive question. I did not get the accurate answer but maybe I missed something.