Error related to object store. "The location of ObjectID(xxx) is unavailable yet. Waiting for further notification."
Opened this issue · 0 comments
Hey guys, when I was trying to do allreduce by hoplite on slurm cluster, I came across such problem. In my log files, it will report such Get failed for 139.19.15.49
. This is the head address of the node. Then later it will say The location of ObjectID(0000000000000000000081131282117120887345) is unavailable yet. Waiting for further notification.
And my process will stuck here. It seems like this is related to gettinig object from the store, maybe this line of code grad_buffer = self.store.get(reduction_id)?
For reference, this is the code of my main function
hoplite.start_location_server()
ray.init(address='auto', ignore_reinit_error=True)
workers = [DataWorker.remote(args_dict, i, model_type=args.model, device='cuda') for i in range(num_workers)]
print("Running synchronous parameter server training.")
step_start = time.time()
for i in range(iterations):
gradients = []
rediction_id = hoplite.random_object_id()
all_grad_ids = [hoplite.random_object_id() for worker in workers]
all_updates = []
for grad_id, worker in zip(all_grad_ids, workers):
all_updates.append(worker.compute_gradients.remote(grad_id, all_grad_ids, rediction_id))
ray.get(all_updates)
now = time.time()
print("step time:", now - step_start, flush=True)
step_start = now
and the code of dataworker
@ray.remote(num_gpus=1)
class DataWorker(object):
def __init__(self, args_dict, rank, model_type="custom", device="cpu"):
self.store = hoplite.create_store_using_dict(args_dict)
self.device = device
self.model = bert.Vit_Wrapper().to(device)
self.optimizer = torch.optim.SGD(self.model.parameters(), lr=0.02)
self.rank = rank
self.is_master = hoplite.get_my_address().encode() == args_dict['redis_address']
def compute_gradients(self, gradient_id, gradient_ids, reduction_id, batch_size=128):
start_time = time.time()
if self.is_master:
#print("i'm master and i start reduce")
reduced_gradient_id = self.store.reduce_async(gradient_ids, hoplite.ReduceOp.SUM, reduction_id)
data = torch.randn(batch_size, 3, 224, 224, device=self.device)
self.model.zero_grad()
output = self.model(data)
loss = torch.mean(output)
loss.backward()
gradients = self.model.get_gradients()
cont_g = np.concatenate([g.ravel().view(np.uint8) for g in gradients])
print(cont_g.shape)
buffer = hoplite.Buffer.from_buffer(cont_g)
gradient_id = self.store.put(buffer, gradient_id)
print(gradient_id)
grad_buffer = self.store.get(reduction_id)
print(grad_buffer)
summed_gradients = self.model.buffer_to_tensors(grad_buffer)
self.optimizer.zero_grad()
self.model.set_gradients(summed_gradients)
self.optimizer.step()
print(self.rank, "in actor time", time.time() - start_time)
return None
So, have any idea about the problem?
Apart from that, I have two other questions. First is that, as you can see, I have some print in my code for quick debugging. While in my slurm logging files, there is output of any print information. Second thing is that I'm not sure is it possible that my problem is actually related to my authentication on slurm cluster. I have no root access. So, maybe hoplite cannot access the appointed store location? Besides function create_store_using_dict
, is there other way to create store whhere I can use my home dirs?