deepmodeling/deepmd-kit

[BUG] _FP32 model predicts `nan` energy_

Closed this issue · 2 comments

Bug summary

Hi, I trained a model using FP32 precision and it's predicting NaN for energy. Switching to FP64, the model works fine.

DeePMD-kit Version

2.2.7

Backend and its version

TensorFlow v.2.14.0

How did you download the software?

Offline packages

Input Files, Running Commands, Error Log, etc.

FP32 Training Input:

{
    "model": {
        "descriptor": {
            "type": "se_e2_a",
            "sel": "auto",
            "rcut_smth": 0.5,
            "rcut": 6.0,
            "neuron": [
                25,
                50,
                100
            ],
            "resnet_dt": false,
            "axis_neuron": 12,
            "seed": 3608149752,
            "precision": "float32",
            "_activation_function": "tanh"
        },
        "fitting_net": {
            "neuron": [
                240,
                240,
                240
            ],
            "resnet_dt": true,
            "precision": "float32",
            "seed": 4147182387
        }
    }
}

FP64 Training Input:

{
    "model": {
        "descriptor": {
            "type": "se_e2_a",
            "sel": "auto",
            "rcut_smth": 0.5,
            "rcut": 6.0,
            "neuron": [
                25,
                50,
                100
            ],
            "resnet_dt": false,
            "axis_neuron": 12,
            "seed": 3608149752,
            "_activation_function": "tanh"
        },
        "fitting_net": {
            "neuron": [
                240,
                240,
                240
            ],
            "resnet_dt": true,
            "seed": 4147182387
        }
    }
}

I have checked with vimdiff; the only difference between these two inputs is the precision parameter.

Running commands:

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --gres=gpu:1
#SBATCH -q normal
source /shared_apps/deepmd-kit/2.2.7/bin/activate /shared_apps/deepmd-kit/2.2.7
dp train input.json
dp freeze -o graph.pb

The script used for prediction:

from ase.io import read
from glob import glob
from pathlib import Path
from deepmd.calculator import DP

dp = DP(model='graph.pb')

# file all the vasp files (*.vasp), find recursively
p = Path('.')
vasp_files = list(p.glob('**/*.vasp'))

# read structure and predict
for i in vasp_files:
    atoms = read(i)
    atoms.calc = dp
    print(atoms.get_potential_energy())

For the FP64 model, I got the following results:

-563605.0009180764
-563605.0445137939
...

But for FP32 model, I got the following results:

nan
nan
...

Steps to Reproduce

training model with fp32. and using this model to predict energy.

Further Information, Files, and Links

No response

There are known issues in v2.2.7. See #2866. Do you get nan in other versions?

Hi, I trained a new model using deepmd3.0.0b4, and the nan problem is fixed. Thank you!