A lot of nan value in n_predictions

Question

A lot of nan value in n_predictions

Opened this issue 2 years ago · 1 comments

I use the rectified YAV images to test the model, and got this error information:
Traceback (most recent call last): File "/home/rc/StereoMatching/RAFT-Stereo/train_stereo.py", line 256, in <module> train(args) File "/home/rc/StereoMatching/RAFT-Stereo/train_stereo.py", line 167, in train loss, metrics = sequence_loss(flow_predictions, flow, valid) File "/home/rc/StereoMatching/RAFT-Stereo/train_stereo.py", line 50, in sequence_loss assert not torch.isnan(flow_preds[i]).any() and not torch.isinf(flow_preds[i]).any() AssertionError
I debug the progrom and found a lot of nan value in n_predictions. Could you plz give me some advice?

Answer 1 · 2022-11-28T16:45:39.000Z

This can happen when training with mixed precision. Two solutions I've found to work:

Use full precision. This will use ~2x as much memory, though.
Clip large gradients midway through the backward pass. You can do this by wrapping convolutions with this function:

import torch
import torch.nn as nn
import torch.nn.functional as F

GRAD_CLIP = .01

class GradClip(torch.autograd.Function):

    @staticmethod
    def forward(ctx, x):
        return x

    @staticmethod
    def backward(ctx, grad_x):
        o = torch.zeros_like(grad_x)
        grad_x = torch.where(grad_x.abs()>GRAD_CLIP, o, grad_x)
        grad_x = torch.where(torch.isnan(grad_x), o, grad_x)
        return grad_x