Training large models using train.py

Question

Training large models using train.py

Opened this issue 6 months ago · 0 comments

I tried to retrain the model with the PDBBind dataset, I ran the train.py script directly without any parameter, and it finished the whole training process very quickly (<30 minutes). I got the following report:

Epoch 399: Val inference rmsds_lt2 0.000 rmsds_lt5 0.000 min_rmsds_lt2 0.000 min_rmsds_lt5 0.000
Best Validation Loss 0.5782018661499023 on Epoch 387
Best inference metric 0.0 on Epoch 399

So my first question is about the training result, is this result indicating a successful training? Why is 0 rmsds reported? And also I think the training loss didn't decrease a lot. For example this is training loss at the beginning of training process:
Epoch 24: Training loss 0.8713 tr 0.2144 rot 1.2553 tor 1.1707 sc 0.0000 lr 0.0010
And this is training loss in the middle of training.
Training loss 0.9436 tr 0.5029 rot 1.4023 tor 0.9541 sc 0.0000 lr 0.0010
So as you can see the training loss didn't change a lot. Is that a normal behavior?

Also, I found that training using train.py with the default setting only results in a model with 3GB per GPU (across 2 GPUs). So what setting should I change to train the large model instead of this small one? Do you have a model configuration file for the large model?