Stuck in initialization

Question

Stuck in initialization

vaidishs opened this issue 2 years ago · 2 comments

I'm trying to training a smile model with ~500 structures, but training in stuck on the 0th iteration.

Output:

Torch device: cuda
Processing dataset...
Loaded data: Batch(atomic_numbers=[52369, 1], batch=[52369], cell=[582, 3, 3], edge_cell_shift=[4453866, 3], edge_index=[2, 4453866], forces=[52369, 3], pbc=[582, 3], pos=[52369, 3], ptr=[583], total_energy=[582, 1])
    processed data size: ~120.96 MB
Cached processed data to disk
Done!
Successfully loaded the data set of type ASEDataset(582)...
Replace string dataset_forces_rms to 0.7190595865249634
Replace string dataset_per_atom_total_energy_mean to -8.752270698547363
Atomic outputs are scaled by: [O, Ti: 0.719060], shifted by [O, Ti: -8.752271].
Replace string dataset_forces_rms to 0.7190595865249634
Initially outputs are globally scaled by: 0.7190595865249634, total_energy are globally shifted by None.
Successfully built the network...
Number of weights: 10987400
Number of trainable weights: 10987400
! Starting training ...

validation
# Epoch batch         loss       loss_f       loss_e       f_rmse     e/N_rmse

The nequip-train is just stuck here for a long time.

Answer 1 · 2023-03-31T12:11:33.000Z

Can you share the input file?

Answer 2 · 2023-03-31T14:26:41.000Z

Figured it out last night .. it was the Torch version .. 1.12 doesn't work .. 1.11 does.
Closing this. Thanks @simonbatzner for a quick response.