ksw0306/ClariNet

KL loss becomes nan

NewEricWang opened this issue · 0 comments

Hi, I use my owner data to train teacher model and student model. The teacher model is normal; but the KL loss of student model becomes nan at about 50k step. The log information is like as follows:
Global Step : 67438, [55, 100] [Total Loss, KL Loss, Reg Loss, Frame Loss] : [ nan nan 2.301 nan]
100 Step Time : 143.40849208831787
Global Step : 67538, [55, 200] [Total Loss, KL Loss, Reg Loss, Frame Loss] : [ nan nan 2.3145 nan]
100 Step Time : 142.97523188591003
Global Step : 67638, [55, 300] [Total Loss, KL Loss, Reg Loss, Frame Loss] : [ nan nan 2.3899 nan]
100 Step Time : 142.70189571380615
Global Step : 67738, [55, 400] [Total Loss, KL Loss, Reg Loss, Frame Loss] : [ nan nan 2.2744 nan]
100 Step Time : 142.38775205612183
Global Step : 67838, [55, 500] [Total Loss, KL Loss, Reg Loss, Frame Loss] : [ nan nan 2.436 nan]
100 Step Time : 142.91834259033203
Global Step : 67938, [55, 600] [Total Loss, KL Loss, Reg Loss, Frame Loss] : [ nan nan 2.3337 nan]
100 Step Time : 142.9723343849182
Global Step : 68038, [55, 700] [Total Loss, KL Loss, Reg Loss, Frame Loss] : [ nan nan 2.5294 nan]
100 Step Time : 142.89931344985962
Global Step : 68138, [55, 800] [Total Loss, KL Loss, Reg Loss, Frame Loss] : [ nan nan 2.3567 nan]
100 Step Time : 143.14595890045166
Global Step : 68238, [55, 900] [Total Loss, KL Loss, Reg Loss, Frame Loss] : [ nan nan 2.3591 nan]
100 Step Time : 143.44508004188538
Global Step : 68338, [55, 1000] [Total Loss, KL Loss, Reg Loss, Frame Loss] : [ nan nan 2.2139 nan]
100 Step Time : 143.32597756385803
Global Step : 68438, [55, 1100] [Total Loss, KL Loss, Reg Loss, Frame Loss] : [ nan nan 2.402 nan]
100 Step Time : 142.90216040611267
Global Step : 68538, [55, 1200] [Total Loss, KL Loss, Reg Loss, Frame Loss] : [ nan nan 2.4041 nan]
100 Step Time : 142.8380832672119
55 Epoch Training Loss : nan
100 [Total, KL, Reg, Frame Loss] : [ nan nan 2.042 nan]