loss突然变成Nan

Question

loss突然变成Nan

DongHan9722 opened this issue 2 years ago · 6 comments

loss突然变成Nan

UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
return (self.sensitivity / (softmax(self.rhos.reshape(189 * self.h * self.w))
2023-06-13 16:26:44,229: Epoch 1 / 24, batch 1 / 49029, 2.8256 sec/batch
loss = [38.096947] prec@1 = [0.000000] prec@5 = [0.000000]
2023-06-13 16:26:44,317: Reducer buckets have been rebuilt in this iteration.
2023-06-13 16:26:44,363: Reducer buckets have been rebuilt in this iteration.
2023-06-13 16:27:21,904: Epoch 1 / 24, batch 100 / 49029, 0.4050 sec/batch
loss = [37.548126] prec@1 = [1.562500] prec@5 = [1.562500]
2023-06-13 16:27:59,892: Epoch 1 / 24, batch 200 / 49029, 0.3924 sec/batch
loss = [38.105316] prec@1 = [0.000000] prec@5 = [0.000000]
2023-06-13 16:28:37,892: Epoch 1 / 24, batch 300 / 49029, 0.3883 sec/batch
loss = [37.709789] prec@1 = [0.000000] prec@5 = [1.562500]
2023-06-13 16:29:15,694: Epoch 1 / 24, batch 400 / 49029, 0.3857 sec/batch
loss = [nan] prec@1 = [0.000000] prec@5 = [0.000000]
2023-06-13 16:29:53,225: Epoch 1 / 24, batch 500 / 49029, 0.3836 sec/batch
loss = [nan] prec@1 = [0.000000] prec@5 = [0.000000]

检查了loss变为Nan之前的features

        inputs = images_to_batch(inputs) # [bs, 63x3, 112, 112]
        inputs = inputs.detach()
        inputs = noise_model(inputs)

        if torch.isnan(inputs).any() or torch.isinf(inputs).any(): # 检查backbone输入
            print('debug inputs')
            print(inputs)

        if self.amp:
            with amp.autocast():
                features = backbone(inputs)
            features = features.float()
        else:
            features = backbone(inputs)

        if torch.isnan(features).any() or torch.isinf(features).any(): # 检查backbone输出
            print('debug features')
            print(features)

features中出现Nan，但是inputs正常，可以确定问题是在backbone里。尝试打印self.input_layer.modules()里的每层输出。
在class Backbone(Module)里的forward中查看每一层的输出。

def forward(self, q): # q是输入
    modules = [module for module in self.input_layer.modules() if not isinstance(module, Sequential)]
    a = modules[0](q) # a是卷积层的输出
    b = modules[1](a) # b是BN的输出
    c = modules[2](b) # c是PReLU的输出

    if (torch.isnan(b).any() or torch.isinf(b).any()):
            print(f"Errorsq: ", torch.isnan(q).any() or torch.isinf(q).any())
            print(f"Errorsa: nan: {torch.isnan(a).any()}, inf {torch.isinf(a).any()}")
            print(f"ErrorsWE: nan: {torch.isnan(modules[0].weight).any()}, inf {torch.isinf(modules[0].weight).any()}")
            raise 123

Errorsq: tensor(False, device='cuda:0') # 输入正常
Errorsa: nan: False, inf True # 卷积输出有inf
ErrorsWE: nan: False, inf False 卷积层的权重正常

最后在backbone的输入层的第一层卷积的输出中发现INF。
@wjxzju 可以帮看一下问题可能在哪里吗？
@wizyoung

Answer 1 · 2023-06-13T15:14:48.000Z

数据集下载来源如下：
Vggface2_FP
因为数据是存储在train.rec中，利用以下代码转换为若干个tfrecord格式文件，同时会生成一个index文件。
@wjxzju 这样去准备数据是正确的吗？

import os
import sys
import cv2
import argparse
import io
import numpy as np
import tensorflow as tf
import mxnet as mx
import PIL.Image
from datetime import datetime as dt

def parse_args():
    parser = argparse.ArgumentParser(
        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
        description='data path information')
    parser.add_argument('--bin_path', default='faces_webface_112x112/train.rec', type=str,
                        help='path to the binary image file')
    parser.add_argument('--idx_path', default='faces_webface_112x112/train.idx', type=str,
                        help='path to the image index path')
    parser.add_argument('--tfrecords_name', default='TFR-CASIA_webface', type=str,
                        help='path to the output of tfrecords dir path')
    args = parser.parse_args()
    return args

def main():
    args = parse_args()
    data_shape = (3, 112, 112)
    print(tf.__version__)
    imgrec = mx.recordio.MXIndexedRecordIO(args.idx_path, args.bin_path, 'r')
    s = imgrec.read_idx(0)
    header, _ = mx.recordio.unpack(s)
    print(header.label)
    imgidx = list(range(1, int(header.label[0])))
    
    tfrecords_dir = os.path.join('./', args.tfrecords_name)
    tfrecords_name = args.tfrecords_name
    if not os.path.isdir(tfrecords_dir):
        os.makedirs(tfrecords_dir)
    
    idx_file = os.path.join(tfrecords_dir, '{}.index'.format(tfrecords_name))
    idx_writer = open(idx_file, 'w')
    
    count = 0
    cur_shard_size = 0
    cur_shard_idx = -1
    cur_shard_writer = None
    cur_shard_path = None
    cur_shard_offset = None
    for i in imgidx:
        img_info = imgrec.read_idx(i)
        header, img = mx.recordio.unpack(img_info)
        label_int = int(header.label)
        label =  np.array(int(label_int), dtype=np.int32).tostring()
        example = tf.train.Example(features=tf.train.Features(feature={
                'label': tf.train.Feature(bytes_list=tf.train.BytesList(value=[label,])),
                'image': tf.train.Feature(bytes_list=tf.train.BytesList(value=[img,]))}))
           
        if cur_shard_size == 0:
            print("{}: {} processed".format(dt.now(), count))
            cur_shard_idx += 1
            record_filename = '{0}-{1:05}.tfrecord'.format(tfrecords_name, cur_shard_idx)
            if cur_shard_writer is not None:
                cur_shard_writer.close()
            cur_shard_path = os.path.join(tfrecords_dir, record_filename)
            cur_shard_writer = tf.io.TFRecordWriter(cur_shard_path)
            cur_shard_offset = 0

        example_bytes = example.SerializeToString()
        cur_shard_writer.write(example_bytes)
        cur_shard_writer.flush()
        idx_writer.write('{}\t{}\t{}\t{}\n'.format(tfrecords_name, cur_shard_idx, cur_shard_offset, label_int))
        cur_shard_offset += (len(example_bytes) + 16)

        count += 1
        cur_shard_size = (cur_shard_size + 1) % 500000
    
    if cur_shard_writer is not None:
        cur_shard_writer.close()
    print('total examples number = {}'.format(count))
    print('total shard number = {}'.format(cur_shard_idx+1))


if __name__ == '__main__':
    main()

Answer 2 · 2023-06-20T03:06:52.000Z

VGG数据集噪声比较大，建议margin调小点

Answer 3 · 2023-06-20T08:19:15.000Z

VGG数据集噪声比较大，建议margin调小点

感谢建议，因为内存限制，batch size 只能设置到64，我把learning rate 调低之后，可以正常训练了。
还有个问题，对于DCTDP项目，我没有看到怎么分train and test 数据集也没看到testing的script，本项目是把VGG所有数据都拿来训练了吗？

Answer 4 · 2023-06-29T07:32:31.000Z

VGG数据集噪声比较大，建议margin调小点

感谢建议，因为内存限制，batch size 只能设置到64，我把learning rate 调低之后，可以正常训练了。还有个问题，对于DCTDP项目，我没有看到怎么分train and test 数据集也没看到testing的script，本项目是把VGG所有数据都拿来训练了吗？

测试集还是和正常的测试集一样，包括LFW、CFP、IJBB、IJBC这些，测试脚本需要你参考训练代码改写一下

Answer 5 · 2023-06-29T08:31:06.000Z

VGG数据集噪声比较大，建议margin调小点

感谢建议，因为内存限制，batch size 只能设置到64，我把learning rate 调低之后，可以正常训练了。还有个问题，对于DCTDP项目，我没有看到怎么分train and test 数据集也没看到testing的script，本项目是把VGG所有数据都拿来训练了吗？

测试集还是和正常的测试集一样，包括LFW、CFP、IJBB、IJBC这些，测试脚本需要你参考训练代码改写一下

如果想在LFW上测试，是拿VGG2Face训练，然后用训练好的模型的backbone当作是feature extractor，用于测试集embedding生成吗？还是需要在LFW 上先做fine-tuning？论文中的accuracy是怎么计算的呀？是测试集上的verification accuracy？

Answer 6 · 2023-06-30T14:12:10.000Z

VGG数据集噪声比较大，建议margin调小点

感谢建议，因为内存限制，batch size 只能设置到64，我把learning rate 调低之后，可以正常训练了。还有个问题，对于DCTDP项目，我没有看到怎么分train and test 数据集也没看到testing的script，本项目是把VGG所有数据都拿来训练了吗？

测试集还是和正常的测试集一样，包括LFW、CFP、IJBB、IJBC这些，测试脚本需要你参考训练代码改写一下

我的训练参数如下：
Task: DCTDP
Dataset: VGGFace2
Model: IR-34
Batch size：64

24个epoch后的loss：
Epoch 24 / 24, batch 24500 / 24515, 0.3803 sec/batch
loss = [1.837506] prec@1 = [96.093750] prec@5 = [97.656250]

我用以上训练好的DCTDP 的backbone在test文件里提供的verification.py进行测试(根据dctdp修改一下)，得到以下accuracy：

Method | LFW | CFP_FP | AgeDB | CALFW | CPLFW
pretrained 5 epoch | 0.719833 | 0.555286 | 0.502167 | 0.590000 | 0.579999
pretrained 24 epoch | 0.709999 | 0.577429 | 0.5125 | 0.586333 | 0.578

可以看到模型的accuracy在只训练5个epoch和24个epoch没有很大的区别。
而且模型在训练24给epoch后，accuracy还是比较低。在LFW测试集上只有0.7左右。

@wjxzju 可以给点建议吗