Error when training process in a Colab TPU environment

Question

Error when training process in a Colab TPU environment

Opened this issue 3 years ago · 3 comments

Hello,
I am trying to replicate the training process in a Colab TPU environment .

In the step 1.2. Or train the mention proposal model yourself. I am getting the following error

  ValueError                                Traceback (most recent call last)
  [/content/drive/MyDrive/corefQA/code/run/run_mention_proposal.py](https://localhost:8080/#) in <module>()
      190     tf.set_random_seed(FLAGS.seed)
      191     # start train/evaluate the model.
  --> 192     tf.app.run()
      193 
      194 
  
  35 frames
  [/tensorflow-1.15.2/python3.7/tensorflow_core/python/ops/resource_variable_ops.py](https://localhost:8080/#) in _init_from_args(self, initial_value, trainable, collections, caching_device, name, dtype, constraint, synchronization, aggregation, distribute_strategy, shape)
     1558               "construct, such as a loop or conditional. When creating a "
     1559               "variable inside a loop or conditional, use a lambda as the "
  -> 1560               "initializer." % name)
     1561         # pylint: enable=protected-access
     1562         dtype = initial_value.dtype.base_dtype
  
  ValueError: Initializer for variable Variable/ is from inside a control-flow construct, such as a loop or conditional. When creating a variable inside a loop or conditional, use a lambda as the initializer

I am using the following command in order to execute the training step:

if train_mention_proposal:
    DATA_DIR = GS_SEMVEVAL_TRFILES
    OUTPUT_DIR = f"{GS_PATH}/models/mention_proposal"
    PRETRAINED_MODEL = GS_SQUAD2_ES_TRAINED_MODEL
  
    INIT_CHECKPOINT=f"{PRETRAINED_MODEL}/model.ckpt" 
  
    %cd {REPO_PATH}

    %run run/run_mention_proposal.py \
      --output_dir=$OUTPUT_DIR \
      --bert_config_file=$BERT_CONFIG \
      --init_checkpoint=$INIT_CHECKPOINT \
      --vocab_file=$BERT_VOCAB \
      --logfile_path=./train_mention_proposal.log \
      --num_epochs=8 \
      --keep_checkpoint_max=50 \
      --save_checkpoints_steps=500 \
      --train_file=$DATA_DIR/train.overlap.corefqa.es.tfrecord \
      --dev_file=$DATA_DIR/dev.overlap.corefqa.es.tfrecord \
      --test_file=$DATA_DIR/test.overlap.corefqa.es.tfrecord \
      --do_train=True \
      --do_eval=False \
      --do_predict=False \
      --learning_rate=1e-5 \
      --dropout_rate=0.2 \
      --mention_threshold=0.5 \
      --hidden_size=1024 \
      --num_docs=5604 \
      --window_size=384 \
      --num_window=6 \
      --max_num_mention=60 \
      --start_end_share=False \
      --loss_start_ratio=0.3 \
      --loss_end_ratio=0.3 \
      --loss_span_ratio=0.3 \
      --use_tpu=True \
      --tpu_name=$TPU_NAME \
      --seed=2333

Do you have any ideas as to what could be the problem?

Thank you in advance

Answer 1 · 2022-05-10T19:14:54.000Z

Have you been able to fix this? I'm experiencing the same problem.

Answer 2 · 2022-05-11T16:54:54.000Z

No, I am stuck with this :(

Answer 3 · 2022-05-17T01:40:46.000Z

I got through it (not sure it was a fix, but tf.cast instead of tf.Variable seemed to help). However, the XLA device kept complaining about something like Input 1 to node Tile_1 with op Tile must be a compile-time constant. So XLA does not support any "cast", "tile" and other such ops that the code abounds in. Maybe the problem is colab and these scripts should be run with remote IP connection to the TPU, as in the original implementation.