Error in distributed training
vardaan123 opened this issue · 2 comments
Hi, I am following the instructions in README to run your code (run_pretrain.sh). However, I get the error OSError: [Errno 99] Cannot assign requested address
. I changed the port on my server to a number between 0-65536. I successfully started the server with python large_emb.py --lr $LR --total_client 8 --emb_name $EMB_NAME --ent_emb ../wikidata5m_alias_emb/entities.npy
. But the following line in run_pretrain.sh which launches the processes gives me this error. Any help is appreciated!
Traceback (most recent call last):
File "run_pretrain.py", line 262, in <module>
train()
File "run_pretrain.py", line 194, in train
cache_dir=PYTORCH_PRETRAINED_BERT_CACHE + '/dist_{}'.format(args.local_rank))
File "/home/pahuja.9/miniconda3/lib/python3.7/site-packages/transformers/modeling_utils.py", line 655, in from_pretrained
model = cls(config, *model_args, **model_kwargs)
File "../pretrain/model.py", line 19, in __init__
self.ent_embeddings = LargeEmbedding(ip_config, emb_name, ent_lr, num_ent)
File "../pretrain/large_emb.py", line 181, in __init__
self.client = EmbClient(server_namebook)
File "../pretrain/large_emb.py", line 97, in __init__
super().__init__(server_namebook, queue_size, net_type)
File "/home/pahuja.9/miniconda3/lib/python3.7/site-packages/dgl/contrib/dis_kvstore.py", line 588, in __init__
self._machine_id = self._get_local_machine_id()
File "/home/pahuja.9/miniconda3/lib/python3.7/site-packages/dgl/contrib/dis_kvstore.py", line 981, in _get_local_machine_id
if ip in self._local_ip4_addr_list():
File "/home/pahuja.9/miniconda3/lib/python3.7/site-packages/dgl/contrib/dis_kvstore.py", line 999, in _local_ip4_addr_list
struct.pack('256s', name[:15].encode("UTF-8")))[20:24])
OSError: [Errno 99] Cannot assign requested address
Sorry for the late reply. Here is a similar issue:
Her solution is to re-write def _get_local_machine_id(self)
(read the namebook
and return the current machine_id
) in KVClient
. See if this works for you :)
Yes, that worked for me. Thanks!