[BUG] _FP32 model predicts `nan` energy_
Closed this issue · 2 comments
MoseyQAQ commented
Bug summary
Hi, I trained a model using FP32
precision and it's predicting NaN
for energy. Switching to FP64
, the model works fine.
DeePMD-kit Version
2.2.7
Backend and its version
TensorFlow v.2.14.0
How did you download the software?
Offline packages
Input Files, Running Commands, Error Log, etc.
FP32 Training Input:
{
"model": {
"descriptor": {
"type": "se_e2_a",
"sel": "auto",
"rcut_smth": 0.5,
"rcut": 6.0,
"neuron": [
25,
50,
100
],
"resnet_dt": false,
"axis_neuron": 12,
"seed": 3608149752,
"precision": "float32",
"_activation_function": "tanh"
},
"fitting_net": {
"neuron": [
240,
240,
240
],
"resnet_dt": true,
"precision": "float32",
"seed": 4147182387
}
}
}
FP64 Training Input:
{
"model": {
"descriptor": {
"type": "se_e2_a",
"sel": "auto",
"rcut_smth": 0.5,
"rcut": 6.0,
"neuron": [
25,
50,
100
],
"resnet_dt": false,
"axis_neuron": 12,
"seed": 3608149752,
"_activation_function": "tanh"
},
"fitting_net": {
"neuron": [
240,
240,
240
],
"resnet_dt": true,
"seed": 4147182387
}
}
}
I have checked with vimdiff
; the only difference between these two inputs is the precision
parameter.
Running commands:
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --gres=gpu:1
#SBATCH -q normal
source /shared_apps/deepmd-kit/2.2.7/bin/activate /shared_apps/deepmd-kit/2.2.7
dp train input.json
dp freeze -o graph.pb
The script used for prediction:
from ase.io import read
from glob import glob
from pathlib import Path
from deepmd.calculator import DP
dp = DP(model='graph.pb')
# file all the vasp files (*.vasp), find recursively
p = Path('.')
vasp_files = list(p.glob('**/*.vasp'))
# read structure and predict
for i in vasp_files:
atoms = read(i)
atoms.calc = dp
print(atoms.get_potential_energy())
For the FP64
model, I got the following results:
-563605.0009180764
-563605.0445137939
...
But for FP32
model, I got the following results:
nan
nan
...
Steps to Reproduce
training model with fp32. and using this model to predict energy.
Further Information, Files, and Links
No response
MoseyQAQ commented
Hi, I trained a new model using deepmd3.0.0b4
, and the nan
problem is fixed. Thank you!