[BUG] _FP32 model predicts `nan` energy_

Question

[BUG] _FP32 model predicts `nan` energy_

Closed this issue 2 months ago · 2 comments

Bug summary

Hi, I trained a model using FP32 precision and it's predicting NaN for energy. Switching to FP64, the model works fine.

DeePMD-kit Version

2.2.7

Backend and its version

TensorFlow v.2.14.0

How did you download the software?

Offline packages

Input Files, Running Commands, Error Log, etc.

FP32 Training Input:

{
    "model": {
        "descriptor": {
            "type": "se_e2_a",
            "sel": "auto",
            "rcut_smth": 0.5,
            "rcut": 6.0,
            "neuron": [
                25,
                50,
                100
            ],
            "resnet_dt": false,
            "axis_neuron": 12,
            "seed": 3608149752,
            "precision": "float32",
            "_activation_function": "tanh"
        },
        "fitting_net": {
            "neuron": [
                240,
                240,
                240
            ],
            "resnet_dt": true,
            "precision": "float32",
            "seed": 4147182387
        }
    }
}

FP64 Training Input:

{
    "model": {
        "descriptor": {
            "type": "se_e2_a",
            "sel": "auto",
            "rcut_smth": 0.5,
            "rcut": 6.0,
            "neuron": [
                25,
                50,
                100
            ],
            "resnet_dt": false,
            "axis_neuron": 12,
            "seed": 3608149752,
            "_activation_function": "tanh"
        },
        "fitting_net": {
            "neuron": [
                240,
                240,
                240
            ],
            "resnet_dt": true,
            "seed": 4147182387
        }
    }
}

I have checked with vimdiff; the only difference between these two inputs is the precision parameter.

Running commands:

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --gres=gpu:1
#SBATCH -q normal
source /shared_apps/deepmd-kit/2.2.7/bin/activate /shared_apps/deepmd-kit/2.2.7
dp train input.json
dp freeze -o graph.pb

The script used for prediction:

from ase.io import read
from glob import glob
from pathlib import Path
from deepmd.calculator import DP

dp = DP(model='graph.pb')

# file all the vasp files (*.vasp), find recursively
p = Path('.')
vasp_files = list(p.glob('**/*.vasp'))

# read structure and predict
for i in vasp_files:
    atoms = read(i)
    atoms.calc = dp
    print(atoms.get_potential_energy())

For the FP64 model, I got the following results:

-563605.0009180764
-563605.0445137939
...

But for FP32 model, I got the following results:

nan
nan
...

Steps to Reproduce

training model with fp32. and using this model to predict energy.

Further Information, Files, and Links

No response

Answer 1 · 2024-10-10T19:08:44.000Z

There are known issues in v2.2.7. See #2866. Do you get nan in other versions?

Answer 2 · 2024-10-12T04:01:54.000Z

Hi, I trained a new model using deepmd3.0.0b4, and the nan problem is fixed. Thank you!