Viredery/tf-eager-fasterrcnn

训练的时候出现 Gradients do not exist for variables

Closed this issue · 9 comments

你好,我在训练的时候出现了问题,loss 为nan,调试也不知道问题出在哪里,逻辑看似都没有问题,数据集的应该输入也正确

log

W1107 09:59:16.315858 140495756453632 optimizer_v2.py:1029] Gradients do not exist for variables ['rcnn_bbox_fc/kernel:0', 'rcnn_bbox_fc/bias:0'] when minimizing the loss.
epoch 0 0 nan
W1107 09:59:39.785169 140495756453632 optimizer_v2.py:1029] Gradients do not exist for variables ['rcnn_bbox_fc/kernel:0', 'rcnn_bbox_fc/bias:0'] when minimizing the loss.
epoch 0 1 nan
W1107 10:00:02.858589 140495756453632 optimizer_v2.py:1029] Gradients do not exist for variables ['rcnn_bbox_fc/kernel:0', 'rcnn_bbox_fc/bias:0'] when minimizing the loss.
epoch 0 2 nan
W1107 10:00:25.615397 140495756453632 optimizer_v2.py:1029] Gradients do not exist for variables ['rcnn_bbox_fc/kernel:0', 'rcnn_bbox_fc/bias:0'] when minimizing the loss.
epoch 0 3 nan

IngLP commented

+1. Same issue here

I don't find this problem. Which dataset do you use?

Maybe try to change

grads = tape.gradient(loss_value, model.variables) optimizer.apply_gradients(zip(grads, model.variables), global_step=tf.train.get_or_create_global_step())

to

grads = tape.gradient(loss_value, model.trainable_variables) optimizer.apply_gradients(zip(grads, model.trainable_variables), global_step=tf.train.get_or_create_global_step())
?

I run this code and it works successfully

import os
import tensorflow as tf
import numpy as np
import visualize
tf.enable_eager_execution()
tf.executing_eagerly()

os.environ['CUDA_VISIBLE_DEVICES'] = '1'


from detection.datasets import coco, data_generator

img_mean = (123.675, 116.28, 103.53)

img_std = (1., 1., 1.)

train_dataset = coco.CocoDataSet('./COCO2017/', 'val',
                                 flip_ratio=0.5,
                                 pad_mode='fixed',
                                 mean=img_mean,
                                 std=img_std,
                                 scale=(640, 896))

train_generator = data_generator.DataGenerator(train_dataset)

from detection.models.detectors import faster_rcnn

model = faster_rcnn.FasterRCNN(
    num_classes=len(train_dataset.get_categories()))


img, img_meta, bboxes, labels = train_dataset[0]
batch_imgs = tf.Variable(np.expand_dims(img, 0))
batch_metas = tf.Variable(np.expand_dims(img_meta, 0))

_ = model((batch_imgs, batch_metas), training=False)

model.load_weights('weights/faster_rcnn.h5', by_name=True)

batch_size = 1

train_tf_dataset = tf.data.Dataset.from_generator(
    train_generator, (tf.float32, tf.float32, tf.float32, tf.int32))
train_tf_dataset = train_tf_dataset.padded_batch(
    batch_size, padded_shapes=([None, None, None], [None], [None, None], [None]))


optimizer = tf.train.MomentumOptimizer(1e-3, 0.9, use_nesterov=True)

epochs = 12

for epoch in range(epochs):
    iterator = train_tf_dataset.make_one_shot_iterator()

    loss_history = []
    for (batch, inputs) in enumerate(iterator):

        batch_imgs, batch_metas, batch_bboxes, batch_labels = inputs
        with tf.GradientTape() as tape:
            rpn_class_loss, rpn_bbox_loss, rcnn_class_loss, rcnn_bbox_loss = \
                model((batch_imgs, batch_metas, batch_bboxes, batch_labels), training=True)

            loss_value = rpn_class_loss + rpn_bbox_loss + rcnn_class_loss + rcnn_bbox_loss

        grads = tape.gradient(loss_value, model.variables)
        optimizer.apply_gradients(zip(grads, model.variables),
                                  global_step=tf.train.get_or_create_global_step())

        loss_history.append(loss_value.numpy())

    print('epoch', epoch, '-', np.mean(loss_history))

the log:

epoch 0 - 1.4815336
epoch 1 - 1.1633286
epoch 2 - 1.0060173
epoch 3 - 0.8848684
epoch 4 - 0.78657615
epoch 5 - 0.69864273
epoch 6 - 0.62510866
epoch 7 - 0.5631116
epoch 8 - 0.51153713
epoch 9 - 0.4704786
epoch 10 - 0.4405557
epoch 11 - 0.40453458

这里我没有安装更新tensorflow2.0,所以如果你们使用的是TensorFlow-2.x-Tutorials中改的2.0版本的话,我也不清楚。

如果是tensorflow2.0版本下遇到的问题,可以关闭下这个issue,等疫情结束我回到学校会去更新tf2.0的

@Viredery 感谢回复~ 我试了你的代码,数据集用coco和自己的数据集都还是出现同样的错误,我用的tensorflow版本为2.1,很有可能是tensorflow版本原因:}

@CanshangD 没事,更新到2.0以上版本,模型有些改动的地方,代码中有些接口到了2.x就不支持了,我后来去写MXNet和PyTorch了,导致这个代码就一直没能更新到2.0版本,让你造成困扰了

IngLP commented

May you please comment in English, such that your contribution can be useful for everyone?
I am trying to use this code too.
Thanks! 😊

@loripino21 This code is based on TensorFlow 1.11 and it works fine.

However, if you want to change to TensorFlow2.0 and modify the code like this https://github.com/dragen1860/TensorFlow-2.x-Tutorials/tree/master/16-fasterRCNN, it may arise the problem described in this issue.

And I will upgrade from TensorFlow 1.11 to TensorFlow 2.0 at leisure~

@CanshangD @loripino21 I upgrade my code to support TensorFlow2.0.0 and close this issue.
If there are any problems when training, you can open a new issue.