abisee/pointer-generator

Get NAN loss after 35k steps

StevenLOL opened this issue ยท 31 comments

Anyone got NAN ?
selection_059

@StevenLOL I see this happen sometimes too -- seems to be a very common problem with Tensorflow training in general.

I've been having this problem and I decreased the learning rate as per various discussions on SO and that seemed to work. After a while I tried increasing it by 0.01 and started getting NaN's again. I've tried restoring the checkpoint and re-running with the lower learning rate, but I'm still seeing NaN. Does this mean my checkpoint is useless?

I am also getting NaN. Found out the culprit to be
line #227: log_dists = [tf.log(dist) for dist in final_dists]
in model.py

@Rahul-Iisc Is your workaround to filter out cases where dist == 0, for example:

log_dists = [tf.log(dist) for dist in final_dists if dist != 0] ?

@hate5six I'm still thinking about an appropriate solution. Each dist is a tensor shape (batch_size, extended_vsize), so I am not sure if dist != 0 will work. Also, I want the log_dists length to be the same as final_dists

Trying to convert NaNs to 0 for now. Need to further look for the cause of 0s coming up in the distribution. @abisee @StevenLOL reopen the issue?

def _change_nan_to_number(self, tensor, number=1):
    return tf.where(tf.logical_not(tf.is_finite(tensor)), tf.ones_like(tensor) * number, tensor)

log_dists = [tf.log(self._change_nan_to_number(dist)) for dist in final_dists]

UPDATE: This didn't work. Loss ended up 0 instead of NaN.

The below change worked for me. Add the below code to def _calc_final_dist(self, vocab_dists, attn_dists). Info is in comments.

      # OOV part of vocab is max_art_oov long. Not all the sequences in a batch will have max_art_oov tokens.
      # That will cause some entries to be 0 in the distribution, which will result in NaN when calulating log_dists
      # Add a very small number to prevent that.

      def add_epsilon(dist, epsilon=sys.float_info.epsilon):
        epsilon_mask = tf.ones_like(dist) * epsilon
        return dist + epsilon_mask

      final_dists = [add_epsilon(dist) for dist in final_dists]
      
      return final_dists

final_dists = [tf.clip_by_value(dist,1e-10,1.) for dist in final_dists]

@lizaigaoge550 did this work for you?
final_dists = [tf.clip_by_value(dist,1e-10,1.) for dist in final_dists]
Could you let me know which line you put this at?
Many thanks

@jamesposhtiger
after finishing the final_dists

can we restore the already trained model after it starts getting NaN ?

@apoorv001 probably not. This is where the concurrent eval job is useful: it saves the 3 best checkpoints (according to dev set) at any time. So in theory it should never save a NaN model. This is what we used to recover from NaN problems.

In any case, I know the NaN thing is very annoying. I haven't had time recently, but I intend to look at the bug, understand what's going wrong, and fix it. In any case @Rahul-Iisc's solution appears to be working for several people currently.

Thanks @abisee for the clarification, however, I have 2 differents runs failed due to NaN after training for days, it would be a great favor to us if you could also upload the trained model along with code.

@Rahul-Iisc I've had another look at the code. I see your point about

OOV part of vocab is max_art_oov long. Not all the sequences in a batch will have max_art_oov tokens. That will cause some entries to be 0 in the distribution, which will result in NaN when calulating log_dists. Add a very small number to prevent that.

However, in theory those zero entries in final_dists i.e. NaN entries in log_dists should never be used because the losses = tf.gather_nd(-log_dist, indices) line, which is supposed to locate -log P(correct word) (equation 6 here) in log_dists, should only pick out source-text-OOVs that are actually in the training example.

So I think there must be something else wrong, either:

  1. Those NaN entries are getting picked out by the tf.gather_nd line (even though they shouldn't), or
  2. Some of the other entries of final_dists are zero (which they shouldn't be, because both vocab_dists and attn_dists are result of softmax functions and they get combined using p_gen which is the result of a sigmoid function). Perhaps due to an underflow problem.

I think the second one seems more likely. I can try to investigate the problem but it's tricky because sometimes you need to run for hours before you can replicate the error.

@abisee If fast replicating is desired, I recommend train with extremely short sequence pair, such as 10-2, and NaN should occur when training loss reach 3.
I find it concerning that epsilon-added version seems easily got stuck with a training loss around 3, too.

This is where the concurrent eval job is useful: it saves the 3 best checkpoints (according to dev set) at any time. So in theory it should never save a NaN model. This is what we used to recover from NaN problems.

@abisee Thanks. But can I use the saved checkpoints from eval for continuing my training after NaN occurred? I have removed everything in log_root/train and copied all necessary files from log_root/eval to log_root/train, and adjusted the filenames and what is in the checkpoint file. Now I have an error showing:

NotFoundError (see above for traceback): Unsuccessful TensorSliceReader constructor: Failed to find any matching files for log_root/cnndm/train/model.ckpt-78447

I've looked further into this and still don't understand where the NaNs are coming from. I changed the code to detect when a NaN occurs, then dump the attention distribution, vocabulary distribution, final distribution and some other stuff to file.

Looking at the dump file, I find that attn_dists and vocab_dists are both all NaN, on every decoder step, for every example in the batch and across the encoder timesteps (for attn_dists) and across the vocabulary (for vocab_dists). Consequently final_dists contains NaNs, therefore log_dists does and the final loss is NaN too.

This is different than what I was expecting. I was expecting to find zero values in final_dists and therefore NaNs in log_dists, but it seems that the problem occurs earlier, somehow causing NaNs everywhere in attn_dists and vocab_dists.

Given this information, I don't see why adding epsilon to final_dists works as a solution, because if attn_dists and vocab_dists contain NaNs, it shouldn't work.

Thanks for your update! The strange thing is, it does work @abisee. By adding epsilon I have not encountered NaNs again. But it does affect the training convergence a bit. By how much, I dunno.

@bluemonk482 I tried adding epsilon as mentioned in the previous discussions above. I still get the NaN loss after one day of training. Can you tell me what was the learning rate you used for your experiments? I tried using 0.1.

@shahbazsyed Well I used as big a learning rate as 0.15, and had no NaN error after adding epsilon. How did you add the epsilon? Like this (as @Rahul-Iisc has suggested):

      def add_epsilon(dist, epsilon=sys.float_info.epsilon):
        epsilon_mask = tf.ones_like(dist) * epsilon
        return dist + epsilon_mask

      final_dists = [add_epsilon(dist) for dist in final_dists]

@bluemonk482 Yes, I added epsilon just as @Rahul-Iisc suggested. I tried with a higher learning rate of 0.2 , but am still getting the same error. Did you change any other parameters ? Mine are the following :

hidden_dim = 300
emb_dim = 256
coverage = true
lr = 0.2

@shahbazsyed I see you have changed the default parameter setting. I used 128 for emb_dim. And why did you increase your learning rate to 0.2 when you had NaN error at 0.1? You can try a smaller learning rate.

@bluemonk482 Still got the NaN with lower learning rate after 42000 steps. I tried decoding with the model trained so far, it was just a bunch of UNKs.

Use tfdbg.

Hello everyone, and thanks for your patience. We've made a few changes that help with the NaN issue.

  • We changed the way the log of the final distribution is calculated. This seemed to result in many fewer NaNs (but we still encountered some).
  • We provide a script to let you directly inspect the checkpoint file, to see whether it's corrupted by NaNs or not.
  • New flag to allow you to restore a best model from the eval directory.
  • Train job now halts when it encounters non-finite loss.
  • The train job now keeps 3 checkpoints at a time -- useful for restoring after NaN.
  • New flag to run Tensorflow Debugger.

The README also contains a section about NaNs.

Edit: We've also provided a pretrained model, which we trained using this new version of the code. See the README.

eduOS commented

I encountered the same problem and found that it is because of the zeros examples in the batch. If the training corpus contains $k * batch_size - a$ where $a<k$ examples/samples there would be $k - a$ zero examples. (single pass)

I just add a line next to line 329 in batcher:

if len(b) != self._hps.batch_size:
    continue

May this help a little.

There might be another cause for NANs. There are some stories in the dataset which only contain highlights and not the article itself. That causes a sequence of [PAD] tokens with attention distribution of all zeros which causes the probability of [PAD] to be zero in final distribution (because of scatter function). And since the only target for that sequence is [PAD] with zero probability it will generate NAN. I removed those articles and it seems to be working.

ygorg commented

I also encounter NaNs when pgen is 1 or 0. This means that either vocab_dists or attn_dists will be 0-filled and final_dists will have 0s in it (for words only in vocabulary or only in input). This will cause the loss to be inf (and the backprop will put NaNs in layers) if a word in reference has a 0 probability.
I will add espilon if pgen is 0 and remove epsilon if pgen is 1.
But as @tianjianjiang, i am concerned that this will bias the model.

when I add this model to transformer, I get nan, when I train model, I try many ways to solve this problem, but it still nan, if anyone interested in it, we can discuss.

when I add this model to transformer, I get nan, when I train model, I try many ways to solve this problem, but it still nan, if anyone interested in it, we can discuss.

Please I'm trying your code of pointer generator built on LCSTS dataset. I changed the dataset with amazon reviews for my academic work purpose; I'm getting all UNKs in output..Please help if you have solved it

Anyone got NAN ?
selection_059

can you please help me how you resolved this? how you print this loss graph?