Stuck in initialization
vaidishs opened this issue · 2 comments
vaidishs commented
I'm trying to training a smile model with ~500 structures, but training in stuck on the 0th iteration.
Output:
Torch device: cuda
Processing dataset...
Loaded data: Batch(atomic_numbers=[52369, 1], batch=[52369], cell=[582, 3, 3], edge_cell_shift=[4453866, 3], edge_index=[2, 4453866], forces=[52369, 3], pbc=[582, 3], pos=[52369, 3], ptr=[583], total_energy=[582, 1])
processed data size: ~120.96 MB
Cached processed data to disk
Done!
Successfully loaded the data set of type ASEDataset(582)...
Replace string dataset_forces_rms to 0.7190595865249634
Replace string dataset_per_atom_total_energy_mean to -8.752270698547363
Atomic outputs are scaled by: [O, Ti: 0.719060], shifted by [O, Ti: -8.752271].
Replace string dataset_forces_rms to 0.7190595865249634
Initially outputs are globally scaled by: 0.7190595865249634, total_energy are globally shifted by None.
Successfully built the network...
Number of weights: 10987400
Number of trainable weights: 10987400
! Starting training ...
validation
# Epoch batch loss loss_f loss_e f_rmse e/N_rmse
The nequip-train is just stuck here for a long time.
simonbatzner commented
Can you share the input file?
vaidishs commented
Figured it out last night .. it was the Torch version .. 1.12 doesn't work .. 1.11 does.
Closing this. Thanks @simonbatzner for a quick response.