EleutherAI/gpt-neo

TPU device does not support heartbeats.

iliemihai opened this issue · 0 comments

Hello,

When I try to train on a v3-32 TPU with tpu-vm-tf-2.6.0-pod image version, I reveive the following error:

Creating heartbeat manager for ['/job:worker/replica:0/task:0/device:CPU:0', '/job:worker/replica:0/task:2/device:CPU:0', '/job:worker/replica:0/task:1/device:CPU:0', '/job:worker/replica:0/task:3/device:CPU:0']
Configuring worker heartbeat: shutdown_mode: WAIT_FOR_COORDINATOR

TPU device does not support heartbeats. Failure handling will be disabled.
training_loop marked as finished
Reraising captured error
Traceback (most recent call last):
  File "/home/dumitrescu_stefan/dev/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1375, in _do_call
    return fn(*args)
  File "/home/dumitrescu_stefan/dev/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1359, in _run_fn
    return self._call_tf_sessionrun(options, feed_dict, fetch_list,
  File "/home/dumitrescu_stefan/dev/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1451, in _call_tf_sessionrun
    return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
tensorflow.python.framework.errors_impl.PermissionDeniedError: From /job:worker/replica:0/task:2:
/home/dumitrescu_stefan; Permission denied
	 [[{{node create_file_writer/CreateSummaryFileWriter}}]]
Recent warning and error logs:
  OP_REQUIRES failed at summary_kernels.cc:50 : Permission denied: /home/dumitrescu_stefan; Permission denied
  OP_REQUIRES failed at summary_kernels.cc:50 : Permission denied: /home/dumitrescu_stefan; Permission denied
  OP_REQUIRES failed at summary_kernels.cc:50 : Permission denied: /home/dumitrescu_stefan; Permission denied

To Reproduce
Run the training script:

  1. python3 main.py --model gpt3_XL_256_Pile --steps_per_checkpoint 40000 --tpu TPU_NAME

Environment (please complete the following information):

  • TPUs: V3-32 with tpu-vm-tf-2.6.0-pod image
  • Configs:
    { "n_head": 32, "n_vocab": 64000, "embed_dropout": 0, "lr": 0.0002, "lr_decay": "cosine", "warmup_steps": 3000, "beta1": 0.9, "beta2": 0.95, "epsilon": 1e-8, "opt_name": "adam", "weight_decay": 0.1, "train_batch_size": 512, "attn_dropout": 0, "train_steps": 286150, "eval_steps": 10, "predict_steps": 1, "res_dropout": 0, "eval_batch_size": 512, "predict_batch_size": 1, "iterations": 500, "n_embd": 2048, "datasets": [["example", 25, "documents_random", 1.0]], "model_path": "/home/dumitrescu_stefan/gpt-neo/neo-models/GPT3_1.3B", "n_ctx": 2048, "n_layer": 24, "scale_by_depth": true, "scale_by_in": false, "attention_types" : [[["global"],24]], "mesh_shape": "x:16,y:2", "layout": "batch:x,memory_length:y,embd:y", "activation_function": "gelu", "recompute_grad": true, "gradient_clipping": 1.0, "tokens_per_mb_per_replica": 2048, "precision": "bfloat16" }
    Dataset config is:
    { "n_vocab": 64000, "path": "/home/dumitrescu_stefan/gpt-neo/data_tfrecords/train_shard_*.tfrecords", "eval_path": "", "tokenizer_path": "/home/dumitrescu_stefan/gpt-neo/tokenizer/tokenizer.json", "eos_id": 1, "padding_id": 0 }