There is a training process and GPU memory usage, but the GPU is not working.

Question

There is a training process and GPU memory usage, but the GPU is not working.

Opened this issue 9 months ago · 2 comments

Hello, here is my code:

ml_potential = FinetunerCalc(
    checkpoint_path="gemnet_t_direct_h512_all.pt",
    mlp_params={
        "tuner": {
            "unfreeze_blocks": [
                "out_blocks.3.seq_forces",
                "out_blocks.3.scale_rbf_F",
                "out_blocks.3.dense_rbf_F",
                "out_blocks.3.out_forces",
                "out_blocks.2.seq_forces",
                "out_blocks.2.scale_rbf_F",
                "out_blocks.2.dense_rbf_F",
                "out_blocks.2.out_forces",
                "out_blocks.1.seq_forces",
                "out_blocks.1.scale_rbf_F",
                "out_blocks.1.dense_rbf_F",
                "out_blocks.1.out_forces",
            ],
            "num_threads": 32
        },
        "optim": {
            "batch_size": 1,
            "num_workers": 0,
            "max_epochs": 400,
            "lr_initial": 0.0003,
            "factor": 0.9,
            "eval_every": 1,
            "patience": 3,
            "checkpoint_every": 100000,
            "scheduler_loss": "train",
            "weight_decay": 0,
            "eps": 1e-8,
            "optimizer_params": {
                "weight_decay": 0,
                "eps": 1e-8,
            },
        },
        "task": {
            "primary_metric": "loss",
        },
        "local_rank": 0
    }, 
)
ml_potential.train(parent_dataset=train_dataset[:2])

my cuda version is 11.3, nvidia-smi can see the training process and GPU memory usage, but the volatile gpu-util is 0, and the power consumption has not increased. Is there a problem with my parameter settings?

Answer 1 · 2023-10-25T22:34:22.000Z

@yinkaai maybe can try add "cpu":False in mlp_params dict. (ref: update oal example for gpu usage #36)

Answer 2 · 2023-10-26T11:11:58.000Z

@yinkaai maybe can try add "cpu":False in mlp_params dict. (ref: update oal example for gpu usage #36)

thank you!