mir-group/allegro

Stuck in initialization

vaidishs opened this issue · 2 comments

I'm trying to training a smile model with ~500 structures, but training in stuck on the 0th iteration.

Output:

Torch device: cuda
Processing dataset...
Loaded data: Batch(atomic_numbers=[52369, 1], batch=[52369], cell=[582, 3, 3], edge_cell_shift=[4453866, 3], edge_index=[2, 4453866], forces=[52369, 3], pbc=[582, 3], pos=[52369, 3], ptr=[583], total_energy=[582, 1])
    processed data size: ~120.96 MB
Cached processed data to disk
Done!
Successfully loaded the data set of type ASEDataset(582)...
Replace string dataset_forces_rms to 0.7190595865249634
Replace string dataset_per_atom_total_energy_mean to -8.752270698547363
Atomic outputs are scaled by: [O, Ti: 0.719060], shifted by [O, Ti: -8.752271].
Replace string dataset_forces_rms to 0.7190595865249634
Initially outputs are globally scaled by: 0.7190595865249634, total_energy are globally shifted by None.
Successfully built the network...
Number of weights: 10987400
Number of trainable weights: 10987400
! Starting training ...

validation
# Epoch batch         loss       loss_f       loss_e       f_rmse     e/N_rmse

The nequip-train is just stuck here for a long time.

Can you share the input file?

Figured it out last night .. it was the Torch version .. 1.12 doesn't work .. 1.11 does.
Closing this. Thanks @simonbatzner for a quick response.