NaN loss and only OOV in the greedy output

Question

NaN loss and only OOV in the greedy output

debajyotidatta opened this issue 6 years ago · 2 comments

The loss initially was decreasing until it reach nan's for a while. I am running it on the squad dataset and the exact argument used for running it is:

python train.py --train_tasks squad --device 0 --data ./.data --save ./results/ --embeddings ./.embeddings/ --train_batch_tokens 2000

So the only change is the train batch tokens to 2000 since my GPU was running out of memory. I am attaching a screenshot. Is there anything I am missing? Should I try something else?

Answer 1 · 2018-11-16T19:29:04.000Z

Well that's no good. Let me try running your exact command on my side to see if I get the same thing. Do you know which iteration this first started on? Is it 438000?

Answer 2 · 2018-11-21T08:42:50.000Z

Well that's no good. Let me try running your exact command on my side to see if I get the same thing. Do you know which iteration this first started on? Is it 438000?

I had the same question when I ran
nvidia-docker run -it --rm -v pwd:/decaNLP/ -u $(id -u):$(id -g) bmccann/decanlp:cuda9_torch041 bash -c "python /decaNLP/train.py --train_tasks squad --device 0"
It started at iretation_316800.