wangcongcong123/ttt

AttributeError: 'TPUStrategyV2' object has no attribute 'experimental_run_v2'

Opened this issue · 0 comments

I'm trying to fine-tune in Colab, and I keep getting this error, but I'm not sure how to fix/get around it.

Any advice?

I believe it's related to this:

per_replica_losses = strategy.experimental_run_v2(train_step, args=(x_train, y_train,))

This is the full error output:

2022-03-13 04:56:03.212 INFO - run: args: {}
Output directory (tmp/t5-small_t2t_content-train) already exists and is not empty, you wanna remove it before start training? (y/n)y
2022-03-13 04:57:45.139 INFO inputs - get_with_prepare_func: reading cached data from /content/train/t5-small-data.pkl
2022-03-13 04:57:45.142 WARNING inputs - get_with_prepare_func: if you changed the max_seq_length/max_src_length/max_tgt_length, this may not correctly loaded, since the /content/train/t5-small-data.pkl is pickled based on first time loading
INFO:tensorflow:Deallocate tpu buffers before initializing tpu system.
2022-03-13 04:57:45.222 INFO tpu_strategy_util - initialize_tpu_system: Deallocate tpu buffers before initializing tpu system.
WARNING:tensorflow:TPU system grpc://10.77.192.66 has already been initialized. Reinitializing the TPU can cause previously created variables on TPU to be lost.
2022-03-13 04:57:45.917 WARNING tpu_strategy_util - initialize_tpu_system: TPU system grpc://10.77.192.66 has already been initialized. Reinitializing the TPU can cause previously created variables on TPU to be lost.
INFO:tensorflow:Initializing the TPU system: grpc://10.77.192.66
2022-03-13 04:57:45.926 INFO tpu_strategy_util - initialize_tpu_system: Initializing the TPU system: grpc://10.77.192.66
INFO:tensorflow:Finished initializing TPU system.
2022-03-13 04:57:53.909 INFO tpu_strategy_util - initialize_tpu_system: Finished initializing TPU system.
2022-03-13 04:57:53.914 INFO - create_model: All TPU devices:
2022-03-13 04:57:53.916 INFO - create_model: LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:0', device_type='TPU')
2022-03-13 04:57:53.920 INFO - create_model: LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:1', device_type='TPU')
2022-03-13 04:57:53.922 INFO - create_model: LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:2', device_type='TPU')
2022-03-13 04:57:53.925 INFO - create_model: LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:3', device_type='TPU')
2022-03-13 04:57:53.928 INFO - create_model: LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:4', device_type='TPU')
2022-03-13 04:57:53.930 INFO - create_model: LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:5', device_type='TPU')
2022-03-13 04:57:53.933 INFO - create_model: LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:6', device_type='TPU')
2022-03-13 04:57:53.935 INFO - create_model: LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:7', device_type='TPU')
INFO:tensorflow:Found TPU system:
2022-03-13 04:57:53.938 INFO tpu_system_metadata - _query_tpu_system_metadata: Found TPU system:
INFO:tensorflow:*** Num TPU Cores: 8
2022-03-13 04:57:53.941 INFO tpu_system_metadata - _query_tpu_system_metadata: *** Num TPU Cores: 8
INFO:tensorflow:*** Num TPU Workers: 1
2022-03-13 04:57:53.945 INFO tpu_system_metadata - _query_tpu_system_metadata: *** Num TPU Workers: 1
INFO:tensorflow:*** Num TPU Cores Per Worker: 8
2022-03-13 04:57:53.948 INFO tpu_system_metadata - _query_tpu_system_metadata: *** Num TPU Cores Per Worker: 8
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)
2022-03-13 04:57:53.952 INFO tpu_system_metadata - _query_tpu_system_metadata: *** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)
2022-03-13 04:57:53.958 INFO tpu_system_metadata - _query_tpu_system_metadata: *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)
2022-03-13 04:57:53.961 INFO tpu_system_metadata - _query_tpu_system_metadata: *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)
2022-03-13 04:57:53.965 INFO tpu_system_metadata - _query_tpu_system_metadata: *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)
2022-03-13 04:57:53.968 INFO tpu_system_metadata - _query_tpu_system_metadata: *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)
2022-03-13 04:57:53.972 INFO tpu_system_metadata - _query_tpu_system_metadata: *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)
2022-03-13 04:57:53.975 INFO tpu_system_metadata - _query_tpu_system_metadata: *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)
2022-03-13 04:57:53.979 INFO tpu_system_metadata - _query_tpu_system_metadata: *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)
2022-03-13 04:57:53.985 INFO tpu_system_metadata - _query_tpu_system_metadata: *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)
2022-03-13 04:57:53.988 INFO tpu_system_metadata - _query_tpu_system_metadata: *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)
2022-03-13 04:57:53.992 INFO tpu_system_metadata - _query_tpu_system_metadata: *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)
2022-03-13 04:57:53.995 INFO tpu_system_metadata - _query_tpu_system_metadata: *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)
All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-small.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.
Model: "tft5_for_conditional_generation_2"


Layer (type) Output Shape Param #

shared (TFSharedEmbeddings) multiple 16449536

encoder (TFT5MainLayer) multiple 18881280

decoder (TFT5MainLayer) multiple 25175808

=================================================================
Total params: 60,506,624
Trainable params: 60,506,624
Non-trainable params: 0


2022-03-13 04:58:18.877 INFO - create_model: None
/content/ttt/ttt/t2t_trainer.py:56: FutureWarning: Passing inputs as a keyword argument is deprecated. Use train_dataset and eval_dataset instead.
FutureWarning,
2022-03-13 04:58:18.946 INFO t2t_trainer - train: set random seed for everything with 122
2022-03-13 04:58:19.412 INFO utils - write_args_enhance: {
"source_field_name": "source",
"target_field_name": "target",
"use_tpu": true,
"do_train": true,
"use_tb": true,
"model_select": "t5-small",
"data_path": "/content/train",
"task": "t2t",
"log_steps": 400,
"scheduler": "warmuplinear",
"do_eval": false,
"tpu_address": "10.77.192.66",
"output_folder": "t5-small_t2t_content-train",
"output_path": "tmp/t5-small_t2t_content-train",
"is_pretrain": false,
"is_load_from_data_cache": true,
"data_cache_path": "/content/train/t5-small-data.pkl",
"source_sequence_length": 111,
"target_sequence_length": 20,
"num_replicas_in_sync": 8,
"best": -Infinity,
"warmup_steps": 233
}
/usr/local/lib/python3.7/dist-packages/keras/optimizer_v2/adam.py:105: UserWarning: The lr argument is deprecated, use learning_rate instead.
super(Adam, self).init(name, **kwargs)
epochs: 0%| | 0/6 [00:00<?, ?it/s]2022-03-13 04:58:19.433 INFO t2t_trainer - train: start training at epoch = 0
2022-03-13 04:58:19.440 INFO t2t_trainer - train: global train batch size = 64
2022-03-13 04:58:19.442 INFO t2t_trainer - train: using learning rate scheduler: warmuplinear
2022-03-13 04:58:19.446 INFO t2t_trainer - train: num_train_examples: 24867, total_steps: 2334, steps_per_epoch: 389
2022-03-13 04:58:19.454 INFO t2t_trainer - train: warmup_steps:233

0%| | 0/389 [00:00<?, ?it/s]
epochs: 0%| | 0/6 [00:00<?, ?it/s]

AttributeError Traceback (most recent call last)
in ()
----> 1 run()

3 frames
/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/func_graph.py in autograph_handler(*args, **kwargs)
1145 except Exception as e: # pylint:disable=broad-except
1146 if hasattr(e, "ag_error_metadata"):
-> 1147 raise e.ag_error_metadata.to_exception(e)
1148 else:
1149 raise

AttributeError: in user code:

File "/content/ttt/ttt/t2t_trainer.py", line 147, in distributed_train_step  *
    per_replica_losses = strategy.experimental_run_v2(train_step, args=(x_train, y_train,))

AttributeError: 'TPUStrategyV2' object has no attribute 'experimental_run_v2'