Error when training process in a Colab TPU environment
Opened this issue · 3 comments
Hello,
I am trying to replicate the training process in a Colab TPU environment .
In the step 1.2. Or train the mention proposal model yourself.
I am getting the following error
ValueError Traceback (most recent call last)
[/content/drive/MyDrive/corefQA/code/run/run_mention_proposal.py](https://localhost:8080/#) in <module>()
190 tf.set_random_seed(FLAGS.seed)
191 # start train/evaluate the model.
--> 192 tf.app.run()
193
194
35 frames
[/tensorflow-1.15.2/python3.7/tensorflow_core/python/ops/resource_variable_ops.py](https://localhost:8080/#) in _init_from_args(self, initial_value, trainable, collections, caching_device, name, dtype, constraint, synchronization, aggregation, distribute_strategy, shape)
1558 "construct, such as a loop or conditional. When creating a "
1559 "variable inside a loop or conditional, use a lambda as the "
-> 1560 "initializer." % name)
1561 # pylint: enable=protected-access
1562 dtype = initial_value.dtype.base_dtype
ValueError: Initializer for variable Variable/ is from inside a control-flow construct, such as a loop or conditional. When creating a variable inside a loop or conditional, use a lambda as the initializer
I am using the following command in order to execute the training step:
if train_mention_proposal:
DATA_DIR = GS_SEMVEVAL_TRFILES
OUTPUT_DIR = f"{GS_PATH}/models/mention_proposal"
PRETRAINED_MODEL = GS_SQUAD2_ES_TRAINED_MODEL
INIT_CHECKPOINT=f"{PRETRAINED_MODEL}/model.ckpt"
%cd {REPO_PATH}
%run run/run_mention_proposal.py \
--output_dir=$OUTPUT_DIR \
--bert_config_file=$BERT_CONFIG \
--init_checkpoint=$INIT_CHECKPOINT \
--vocab_file=$BERT_VOCAB \
--logfile_path=./train_mention_proposal.log \
--num_epochs=8 \
--keep_checkpoint_max=50 \
--save_checkpoints_steps=500 \
--train_file=$DATA_DIR/train.overlap.corefqa.es.tfrecord \
--dev_file=$DATA_DIR/dev.overlap.corefqa.es.tfrecord \
--test_file=$DATA_DIR/test.overlap.corefqa.es.tfrecord \
--do_train=True \
--do_eval=False \
--do_predict=False \
--learning_rate=1e-5 \
--dropout_rate=0.2 \
--mention_threshold=0.5 \
--hidden_size=1024 \
--num_docs=5604 \
--window_size=384 \
--num_window=6 \
--max_num_mention=60 \
--start_end_share=False \
--loss_start_ratio=0.3 \
--loss_end_ratio=0.3 \
--loss_span_ratio=0.3 \
--use_tpu=True \
--tpu_name=$TPU_NAME \
--seed=2333
Do you have any ideas as to what could be the problem?
Thank you in advance
Have you been able to fix this? I'm experiencing the same problem.
No, I am stuck with this :(
I got through it (not sure it was a fix, but tf.cast instead of tf.Variable seemed to help). However, the XLA device kept complaining about something like Input 1 to node Tile_1 with op Tile must be a compile-time constant
. So XLA does not support any "cast", "tile" and other such ops that the code abounds in. Maybe the problem is colab and these scripts should be run with remote IP connection to the TPU, as in the original implementation.