captain-pool/GSOC

TPU Estimator Crashing

Opened this issue · 2 comments

Tensorflow version: tensorflow==2.0.0b0
Tensorflow Datasets Version: tfds-nightly==1.0.2.dev201906090105
Tensorflow Hub Version: tf-hub-nightly==0.5.0.dev201905270046

Issue

Code Raises
End of sequence [[node input_pipeline_task0/while/IteratorGetNext (defined at image_retraining_tpu.py:139) ]]
for All values of max_steps in TPUEstimator.train(...)

Reproduce the issue

$ python3 image_retraining_tpu.py --tpu [TPU_NAME] \
--use_tpu --use_compat --data_dir gs://[BUCKET_NAME]/data_dir \
--model_dir gs://[BUCKET_NAME]/model_dir --batch_size=32 \
--iterations=8 --max_steps=8

The Same error rises for

--use_tpu --use_compat --data_dir gs://[BUCKET_NAME]/data_dir \
--model_dir gs://[BUCKET_NAME]/model_dir --batch_size=32 \
--iterations=8 --max_steps=4
$ python3 image_retraining_tpu.py --tpu [TPU_NAME] \
--use_tpu --use_compat --data_dir gs://[BUCKET_NAME]/data_dir \
--model_dir gs://[BUCKET_NAME]/model_dir --batch_size=32 \
--iterations=8 --max_steps=100
$ python3 image_retraining_tpu.py --tpu [TPU_NAME] \
--use_tpu --use_compat --data_dir gs://[BUCKET_NAME]/data_dir \
--model_dir gs://[BUCKET_NAME]/model_dir --batch_size=32 \
--iterations=8 --max_steps=500
$ python3 image_retraining_tpu.py --tpu [TPU_NAME] \
--use_tpu --use_compat --data_dir gs://[BUCKET_NAME]/data_dir \
--model_dir gs://[BUCKET_NAME]/model_dir --batch_size=32 \
--iterations=8 --max_steps=1000

Line 139

classifier.train(
input_fn=lambda params: input_fn(
mode=tf.estimator.ModeKeys.TRAIN,
**params),
max_steps=FLAGS.max_steps)

Log file

Error starts from Line 230 of output.log
output.log

CC: @srjoglekar246 @vbardiovskyg

This looks likes a bug with the TPUEstimator. As far as I understand this part of the docs, the Estimator API handles the OutofRange error from the input data function by stopping iterations (and not raising an exception). TPUEstimator doesn't seem to behave that way yet.
Can you open an issue on TF to cross-check?
Also, does the script work with the try...except block?

Nope it doesn't. Actually, weirdly enough the code doesn't stop running. It keeps on saying that TPU is Healthy and tries to refresh the token and Doesn't break out, even if there's no more code to execute.