Xirider/finetune-gpt2xl

Suspected optimizer issue causing crashes

kinoc opened this issue · 1 comments

kinoc commented

I am running a the code on a single box with a TITAN and 2080Ti. The trainer is running just on the TITAN. I have a problem where the system will lock up the cpu and kill the local network. Very non-performant ...

It seems to be related to microsoft/DeepSpeed#679
It appears changing the optimizer section of the JSON file seems to allow it to run. A bit slower, but it does run.

"optimizer": {
    "type": "Adam",
    "params": {
        "torch_adam":true,
        "lr": 0.00001,
        "betas": [
            0.9,
            0.95
        ],
        "eps": 1e-8,
        "weight_decay": 0.1
    }
}

The big thing being setting "torch_adam" to true.
Any ideas for regaining regular performance would be appreciated.

Installing "cudatoolkit-dev" solved the issue without touching the config.