txsun1997/CoLAKE

Error in distributed training

vardaan123 opened this issue · 2 comments

Hi, I am following the instructions in README to run your code (run_pretrain.sh). However, I get the error OSError: [Errno 99] Cannot assign requested address. I changed the port on my server to a number between 0-65536. I successfully started the server with python large_emb.py --lr $LR --total_client 8 --emb_name $EMB_NAME --ent_emb ../wikidata5m_alias_emb/entities.npy. But the following line in run_pretrain.sh which launches the processes gives me this error. Any help is appreciated!

Traceback (most recent call last):
  File "run_pretrain.py", line 262, in <module>
    train()
  File "run_pretrain.py", line 194, in train
    cache_dir=PYTORCH_PRETRAINED_BERT_CACHE + '/dist_{}'.format(args.local_rank))
  File "/home/pahuja.9/miniconda3/lib/python3.7/site-packages/transformers/modeling_utils.py", line 655, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
  File "../pretrain/model.py", line 19, in __init__
    self.ent_embeddings = LargeEmbedding(ip_config, emb_name, ent_lr, num_ent)
  File "../pretrain/large_emb.py", line 181, in __init__
    self.client = EmbClient(server_namebook)
  File "../pretrain/large_emb.py", line 97, in __init__
    super().__init__(server_namebook, queue_size, net_type)
  File "/home/pahuja.9/miniconda3/lib/python3.7/site-packages/dgl/contrib/dis_kvstore.py", line 588, in __init__
    self._machine_id = self._get_local_machine_id()
  File "/home/pahuja.9/miniconda3/lib/python3.7/site-packages/dgl/contrib/dis_kvstore.py", line 981, in _get_local_machine_id
    if ip in self._local_ip4_addr_list():
  File "/home/pahuja.9/miniconda3/lib/python3.7/site-packages/dgl/contrib/dis_kvstore.py", line 999, in _local_ip4_addr_list
    struct.pack('256s', name[:15].encode("UTF-8")))[20:24])
OSError: [Errno 99] Cannot assign requested address

Sorry for the late reply. Here is a similar issue:

#7 (comment)

Her solution is to re-write def _get_local_machine_id(self) (read the namebook and return the current machine_id) in KVClient. See if this works for you :)

Yes, that worked for me. Thanks!