[BUG] dipole model (pt backend) does not work for systems with different number of atoms
Closed this issue · 3 comments
Bug summary
When training a dipole model with multiple systems with different numbers of atoms, an error is thrown when trying to merge the arrays in different shapes. Training with a single system or multiple systems with the same number of atoms is fine.
DeePMD-kit Version
DeePMD-kit v3.0.0b5.dev52+g13e247ec
Backend and its version
PyTorch v2.2.1-g6c8c5ad5eaf
How did you download the software?
Built from source
Input Files, Running Commands, Error Log, etc.
Input file (which is adapted from the official example of the water dipole model):
{
"_comment1": " model parameters",
"model": {
"type_map": ["O", "H"],
"atom_exclude_types": [1],
"descriptor": {
"type": "se_e2_a",
"sel": [46, 92],
"rcut_smth": 3.8,
"rcut": 4.0,
"neuron": [25, 50, 100],
"resnet_dt": false,
"axis_neuron": 6,
"type_one_side": true,
"precision": "float64",
"seed": 1,
"_comment2": " that's all"
},
"fitting_net": {
"type": "dipole",
"neuron": [100, 100, 100],
"resnet_dt": true,
"precision": "float64",
"seed": 1,
"_comment3": " that's all"
},
"_comment4": " that's all"
},
"learning_rate": {
"type": "exp",
"start_lr": 0.01,
"decay_steps": 5000,
"_comment5": "that's all"
},
"loss": {
"type": "tensor",
"pref": 0.0,
"pref_atomic": 1.0,
"_comment6": " that's all"
},
"_comment7": " traing controls",
"training": {
"training_data": {
"systems": "./test-data",
"batch_size": "auto",
"_comment8": "that's all"
},
"numb_steps": 2000,
"seed": 10,
"disp_file": "lcurve.out",
"disp_freq": 100,
"save_freq": 1000,
"_comment10": "that's all"
},
"_comment11": "that's all"
}
Running command:
dp --pt train input.json
Error log:
To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, DP_INTRA_OP_PARALLELISM_THREADS, and DP_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
[2024-11-16 14:35:09,752] DEEPMD INFO DeePMD version: 3.0.0b5.dev52+g13e247ec
[2024-11-16 14:35:09,752] DEEPMD INFO Configuration path: mix_data.json
[2024-11-16 14:35:09,763] DEEPMD INFO _____ _____ __ __ _____ _ _ _
[2024-11-16 14:35:09,763] DEEPMD INFO | __ \ | __ \ | \/ || __ \ | | (_)| |
[2024-11-16 14:35:09,763] DEEPMD INFO | | | | ___ ___ | |__) || \ / || | | | ______ | | __ _ | |_
[2024-11-16 14:35:09,763] DEEPMD INFO | | | | / _ \ / _ \| ___/ | |\/| || | | ||______|| |/ /| || __|
[2024-11-16 14:35:09,763] DEEPMD INFO | |__| || __/| __/| | | | | || |__| | | < | || |_
[2024-11-16 14:35:09,763] DEEPMD INFO |_____/ \___| \___||_| |_| |_||_____/ |_|\_\|_| \__|
[2024-11-16 14:35:09,763] DEEPMD INFO Please read and cite:
[2024-11-16 14:35:09,763] DEEPMD INFO Wang, Zhang, Han and E, Comput.Phys.Comm. 228, 178-184 (2018)
[2024-11-16 14:35:09,763] DEEPMD INFO Zeng et al, J. Chem. Phys., 159, 054801 (2023)
[2024-11-16 14:35:09,763] DEEPMD INFO See https://deepmd.rtfd.io/credits/ for details.
[2024-11-16 14:35:09,763] DEEPMD INFO --------------------------------------------------------------------------------------------------------
[2024-11-16 14:35:09,763] DEEPMD INFO installed to: /home/jxzhu/apps/miniconda3/envs/deepmd-devel/lib/python3.10/site-packages/deepmd
[2024-11-16 14:35:09,763] DEEPMD INFO /home/jxzhu/apps/deepmd/devel/deepmd
[2024-11-16 14:35:09,763] DEEPMD INFO source: v3.0.0b4-52-g13e247ec
[2024-11-16 14:35:09,763] DEEPMD INFO source branch: devel
[2024-11-16 14:35:09,763] DEEPMD INFO source commit: 13e247ec
[2024-11-16 14:35:09,763] DEEPMD INFO source commit at: 2024-10-26 18:25:18 +0000
[2024-11-16 14:35:09,763] DEEPMD INFO use float prec: double
[2024-11-16 14:35:09,763] DEEPMD INFO build variant: cuda
[2024-11-16 14:35:09,763] DEEPMD INFO Backend: PyTorch
[2024-11-16 14:35:09,763] DEEPMD INFO PT ver: v2.2.1-g6c8c5ad5eaf
[2024-11-16 14:35:09,763] DEEPMD INFO Enable custom OP: False
[2024-11-16 14:35:09,763] DEEPMD INFO running on: jxzhu
[2024-11-16 14:35:09,763] DEEPMD INFO computing device: cuda:0
[2024-11-16 14:35:09,763] DEEPMD INFO CUDA_VISIBLE_DEVICES: unset
[2024-11-16 14:35:09,763] DEEPMD INFO Count of visible GPUs: 1
[2024-11-16 14:35:09,763] DEEPMD INFO num_intra_threads: 0
[2024-11-16 14:35:09,763] DEEPMD INFO num_inter_threads: 0
[2024-11-16 14:35:09,763] DEEPMD INFO --------------------------------------------------------------------------------------------------------
[2024-11-16 14:35:09,803] DEEPMD INFO Calculate neighbor statistics... (add --skip-neighbor-stat to skip this step)
[2024-11-16 14:35:10,054] DEEPMD INFO Adjust batch size from 1024 to 2048
[2024-11-16 14:35:10,145] DEEPMD INFO Adjust batch size from 2048 to 4096
[2024-11-16 14:35:10,232] DEEPMD INFO Adjust batch size from 4096 to 8192
[2024-11-16 14:35:10,444] DEEPMD INFO Adjust batch size from 8192 to 16384
[2024-11-16 14:35:10,653] DEEPMD INFO Adjust batch size from 16384 to 32768
[2024-11-16 14:35:10,866] DEEPMD INFO Adjust batch size from 32768 to 16384
[2024-11-16 14:35:11,127] DEEPMD INFO training data with min nbor dist: 0.999890527057838
[2024-11-16 14:35:11,127] DEEPMD INFO training data with max nbor size: [20 36]
[2024-11-16 14:35:11,146] DEEPMD INFO Packing data for statistics from 3 systems
Traceback (most recent call last):
File "/home/jxzhu/apps/miniconda3/envs/deepmd-devel/bin/dp", line 8, in <module>
sys.exit(main())
File "/home/jxzhu/apps/deepmd/devel/deepmd/main.py", line 927, in main
deepmd_main(args)
File "/home/jxzhu/apps/miniconda3/envs/deepmd-devel/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "/home/jxzhu/apps/deepmd/devel/deepmd/pt/entrypoints/main.py", line 527, in main
train(
File "/home/jxzhu/apps/deepmd/devel/deepmd/pt/entrypoints/main.py", line 339, in train
trainer = get_trainer(
File "/home/jxzhu/apps/deepmd/devel/deepmd/pt/entrypoints/main.py", line 191, in get_trainer
trainer = training.Trainer(
File "/home/jxzhu/apps/deepmd/devel/deepmd/pt/train/training.py", line 293, in __init__
self.get_sample_func = single_model_stat(
File "/home/jxzhu/apps/deepmd/devel/deepmd/pt/train/training.py", line 233, in single_model_stat
_model.compute_or_load_stat(
File "/home/jxzhu/apps/deepmd/devel/deepmd/pt/model/model/make_model.py", line 573, in compute_or_load_stat
return self.atomic_model.compute_or_load_stat(sampled_func, stat_file_path)
File "/home/jxzhu/apps/deepmd/devel/deepmd/pt/model/atomic_model/dp_atomic_model.py", line 294, in compute_or_load_stat
self.compute_or_load_out_stat(wrapped_sampler, stat_file_path)
File "/home/jxzhu/apps/deepmd/devel/deepmd/pt/model/atomic_model/base_atomic_model.py", line 396, in compute_or_load_out_stat
self.change_out_bias(
File "/home/jxzhu/apps/deepmd/devel/deepmd/pt/model/atomic_model/base_atomic_model.py", line 463, in change_out_bias
bias_out, std_out = compute_output_stats(
File "/home/jxzhu/apps/deepmd/devel/deepmd/pt/utils/stat.py", line 367, in compute_output_stats
bias_atom_a, std_atom_a = compute_output_stats_atomic(
File "/home/jxzhu/apps/deepmd/devel/deepmd/pt/utils/stat.py", line 550, in compute_output_stats_atomic
merged_output = {
File "/home/jxzhu/apps/deepmd/devel/deepmd/pt/utils/stat.py", line 551, in <dictcomp>
kk: to_numpy_array(torch.cat(outputs[kk]))
File "/home/jxzhu/apps/miniconda3/envs/deepmd-devel/lib/python3.10/site-packages/torch/utils/_device.py", line 77, in __torch_function__
return func(*args, **kwargs)
RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 702 but got size 864 for tensor number 1 in the list.
Steps to Reproduce
mkdir test && cd test
wget -c https://github.com/user-attachments/files/17783779/test-data.tar.gz
tar -zxvf test-data.tar.gz
# get input.json file
dp --pt train input.json
Further Information, Files, and Links
My guess is that at line 550 we are concatenating tensor labels of shape [nframe, nloc * ndim]. This will fail because for the first system, there are 234 atoms each has a dipole lable of shape 3 (hence 702) and for the second system the number of atom is 288.
deepmd-kit/deepmd/pt/utils/stat.py
Lines 550 to 554 in 0ad4289
To fix this, I think we can simply reshape the tensor label into [nframe * nloc, 1, ndim]. Did a quick test, and it seems this should work.
will create a PR to fix this soon.
@ChiahsinChu can we add your test data into UT?
@ChiahsinChu can we add your test data into UT?
Sure.