Suspected optimizer issue causing crashes
kinoc opened this issue · 1 comments
kinoc commented
I am running a the code on a single box with a TITAN and 2080Ti. The trainer is running just on the TITAN. I have a problem where the system will lock up the cpu and kill the local network. Very non-performant ...
It seems to be related to microsoft/DeepSpeed#679
It appears changing the optimizer section of the JSON file seems to allow it to run. A bit slower, but it does run.
"optimizer": {
"type": "Adam",
"params": {
"torch_adam":true,
"lr": 0.00001,
"betas": [
0.9,
0.95
],
"eps": 1e-8,
"weight_decay": 0.1
}
}
The big thing being setting "torch_adam" to true.
Any ideas for regaining regular performance would be appreciated.
CharanSG commented
Installing "cudatoolkit-dev" solved the issue without touching the config.