opendilab/LightZero

How to config to use multi GPUs to train a model ?

valkryhx opened this issue · 19 comments

I have 2 GPUs on a node, so how to set the config to use the 2 GPUs for training?
For now I find that only 1 card is in use.
Thanks.

Hello, we have completed the setup for multi-GPU training configuration utilizing the Distributed Data Parallel (DDP) technology provided by PyTorch in our code repository. You can view and test the configuration file locally via this link: atari_muzero_multigpu_ddp_config.py. Should you encounter any questions or issues during the testing process, please feel free to reach out to us. Best wishes!

I follow the link above ,here is the error info:
2 gpus per node and the rank is 0 and 1
get_rank()=0
get_rank()=1 with the code here[get_rank()]
(

tb_logger = SummaryWriter(os.path.join('./{}/log/'.format(cfg.exp_name), 'serial')) if get_rank() == 0 else None
)
then tb_logger will be None for the process with rank=1 ,and I got the error,due to the None tb_logger has no attribute 'add_scalar':

Traceback (most recent call last):
File "/kaggle/working/LightZero/./zoo/classic_control/grid/config/mygrid_efficientzero_ddp_config.py", line 169, in
train_muzero([main_config, create_config], seed=0, max_env_step=max_env_step)
File "/kaggle/working/LightZero/lzero/entry/train_muzero.py", line 138, in train_muzero
log_buffer_memory_usage(learner.train_iter, replay_buffer, tb_logger)
File "/kaggle/working/LightZero/lzero/entry/utils.py", line 53, in log_buffer_memory_usage
writer.add_scalar('Buffer/num_of_all_collected_episodes', buffer.num_of_collected_episodes, train_iter)
AttributeError: 'NoneType' object has no attribute 'add_scalar'
Exception ignored in: <function MuZeroCollector.del at 0x7bb90184fb50>

But When I comment like this
_tb_logger = SummaryWriter(os.path.join('./{}/log/'.format(cfg.exp_name), 'serial')) #if get_rank() == 0 else None

in the entry/train_muzero.py, another error arises ,also seems to be with the NoneType error.

Traceback (most recent call last):
File "/kaggle/working/LightZero/./zoo/classic_control/grid/config/mygrid_efficientzero_ddp_config.py", line 169, in
train_muzero([main_config, create_config], seed=0, max_env_step=max_env_step)
File "/kaggle/working/LightZero/lzero/entry/train_muzero.py", line 197, in train_muzero
log_vars = learner.train(train_data, collector.envstep)
File "/opt/conda/lib/python3.10/site-packages/ding/worker/learner/base_learner.py", line 165, in wrapper
ret = fn(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/ding/worker/learner/base_learner.py", line 205, in train
log_vars = self._policy.forward(data, **policy_kwargs)
File "/kaggle/working/LightZero/lzero/policy/efficientzero.py", line 476, in _forward_learn
self.sync_gradients(self._learn_model)
File "/opt/conda/lib/python3.10/site-packages/ding/policy/base_policy.py", line 427, in sync_gradients
allreduce(param.grad.data)
AttributeError: 'NoneType' object has no attribute 'data'

Here is the kaggle link you can refer to .
kaggle

Hello, is your mygrid_efficientzero_ddp_config.py file a modification of the atari_efficientzero_multigpu_ddp_config.py file? Are the relevant settings consistent with it?

Hello, is your mygrid_efficientzero_ddp_config.py file a modification of the atari_efficientzero_multigpu_ddp_config.py file? Are the relevant settings consistent with it?

I modified a bit to create the mygrid_efficientzero_ddp_config.py according to the atari ddp config file , and in the same directory I also created a mygrid_efficientzero_config.py whcih can run normally with 1 gpu . Both files are a bit different from the original repo, but only the ddp_config cannot run normally.
In the ddp config, what I changed are :
gpu_num = 2
n_episode = int(8*gpu_num)
multi_gpu=True
and training codes in DDPContext()

  • Based on the error information and code you provided, the issue primarily arises during multi-GPU training where only the process with rank 0 creates the TensorBoard's SummaryWriter object, while the tb_logger for other processes is None. This leads to an AttributeError when other processes attempt to use tb_logger. To resolve this issue, you should check if tb_logger is None before using it. If it is None, then skip the logging operations related to it. You can modify the code as follows:
  • Add a check for whether writer is None in the function log_buffer_memory_usage within utils.py:
def log_buffer_memory_usage(train_iter, buffer, writer):
    if writer is not None:
        writer.add_scalar('Buffer/num_of_all_collected_episodes', buffer.num_of_collected_episodes, train_iter)
        ...
  • With the aforementioned modifications, you should be able to resolve the errors caused by tb_logger being None. In this manner, only the process with rank 0 will record TensorBoard logs, while other processes will bypass the logging portion.

  • Hello, please test it according to the above modifications. If there are any problems, we will deal with them later. Due to resource constraints now, we will test and fully fix the above issues in the main branch later.

Thanks for your reply,I am testing on it.
Besides, during training the model ckpt is saved every 10^4 steps, which is too long.
How to set the save_ckpt_freq in the config file ? I only find the eval_freq.

After

  • Based on the error information and code you provided, the issue primarily arises during multi-GPU training where only the process with rank 0 creates the TensorBoard's SummaryWriter object, while the tb_logger for other processes is None. This leads to an AttributeError when other processes attempt to use tb_logger. To resolve this issue, you should check if tb_logger is None before using it. If it is None, then skip the logging operations related to it. You can modify the code as follows:
  • Add a check for whether writer is None in the function log_buffer_memory_usage within utils.py:
def log_buffer_memory_usage(train_iter, buffer, writer):
    if writer is not None:
        writer.add_scalar('Buffer/num_of_all_collected_episodes', buffer.num_of_collected_episodes, train_iter)
        ...
  • With the aforementioned modifications, you should be able to resolve the errors caused by tb_logger being None. In this manner, only the process with rank 0 will record TensorBoard logs, while other processes will bypass the logging portion.
  • Hello, please test it according to the above modifications. If there are any problems, we will deal with them later. Due to resource constraints now, we will test and fully fix the above issues in the main branch later.

After modification, a new error arises:

Traceback (most recent call last):
File "/kaggle/working/LightZero/./zoo/classic_control/grid/config/mygrid_efficientzero_ddp_config.py", line 169, in
train_muzero([main_config, create_config], seed=0, max_env_step=max_env_step)
File "/kaggle/working/LightZero/lzero/entry/train_muzero.py", line 197, in train_muzero
log_vars = learner.train(train_data, collector.envstep)
File "/opt/conda/lib/python3.10/site-packages/ding/worker/learner/base_learner.py", line 165, in wrapper
ret = fn(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/ding/worker/learner/base_learner.py", line 205, in train
log_vars = self._policy.forward(data, **policy_kwargs)
File "/kaggle/working/LightZero/lzero/policy/efficientzero.py", line 476, in _forward_learn
self.sync_gradients(self._learn_model)
File "/opt/conda/lib/python3.10/site-packages/ding/policy/base_policy.py", line 427, in sync_gradients
allreduce(param.grad.data)
AttributeError: 'NoneType' object has no attribute 'data'

save_ckpt_freq

You can control the frequency at which checkpoints (ckpt) are saved by adjusting the save_ckpt_after_iter parameter within your configuration file. To do so, include the learn dictionary within your configuration, as demonstrated below:

config = {
    ...
    'eval_freq': int(1e2),
    'learn': {
        'learner': {
            'hook': {
                'log_show_after_iter': 100,
                'save_ckpt_after_iter': 100,   # Set this to your desired frequency
                'save_ckpt_after_run': True,
            },
        },
    },
}

By setting the save_ckpt_after_iter parameter to the desired number of iterations, you dictate how often a checkpoint should be saved. In the example above, the system is configured to save a checkpoint after every 100 iterations. Adjust this value according to your requirements and resource constraints.

AttributeError: 'NoneType' object has no attribute 'data'

Hi, we will get back to you later as we are currently facing limitations in accessing GPU resources.

save_ckpt_freq

You can control the frequency at which checkpoints (ckpt) are saved by adjusting the save_ckpt_after_iter parameter within your configuration file. To do so, include the learn dictionary within your configuration, as demonstrated below:

config = {
    ...
    'eval_freq': int(1e2),
    'learn': {
        'learner': {
            'hook': {
                'log_show_after_iter': 100,
                'save_ckpt_after_iter': 100,   # Set this to your desired frequency
                'save_ckpt_after_run': True,
            },
        },
    },
}

By setting the save_ckpt_after_iter parameter to the desired number of iterations, you dictate how often a checkpoint should be saved. In the example above, the system is configured to save a checkpoint after every 100 iterations. Adjust this value according to your requirements and resource constraints.

I modified like this
but it does not seem to take effect.

but it does not seem to take effect.

Hello, it should be something like this mygrid_efficientzero_config.policy.update(save_freq_dict)

Hello,

Thank you for your patience. We have addressed the training issues with atari_efficientzero_multigpu_ddp_config.py in this pull request: #200. You are invited to conduct a local test to verify the fixes.

Best wishes.

Thank you for your detailed reply and help.
I will test the new codes.

'save_ckpt_after_iter': 100,

It works after I modify the policy config as you instruct.
And there comes another question. When I set 'save_ckpt_after_iter': 100, many temp ckpt will be saved,
like iteration_100.ckpt , iteration_200.ckpt , iteration_10000.ckpt , etc.
Then how to config to just remain only a limitied number of latest ckpts? Like the latest 5 ckpts?

Hello,

Thank you for your patience. We have addressed the training issues with atari_efficientzero_multigpu_ddp_config.py in this pull request: #200. You are invited to conduct a local test to verify the fixes.

Best wishes.

The fixes really work.

I test the fixed ddp codes and it really works well. And I find that when I eval the trained model ckpt , a temp dir with model ckpt will be made ,like mygrid_efficientzero_ns40_upc50_rr0_seed0_240320_040558 . Does the evaluation make this ckpt? How to not save it?

Then how to config to just remain only a limitied number of latest ckpts? Like the latest 5 ckpts?

Hello, currently we haven't implemented this feature yet, but it is indeed necessary for saving space. If you're interested, you can submit a pull request to modify the storage logic in this file. For your local environment, I suggest using a script to implement this feature.

How to not save it?

Hello, this file is indeed generated by the evaluation script and it records many key details. Specifically, it is generated at this location: https://github.com/opendilab/DI-engine/blob/main/ding/config/config.py#L465. During the evaluation process, statistical information is written into this file. We recommend retaining the file for subsequent reviews and checks. If you prefer not to keep the file, it currently can only be deleted manually.

How to not save it?

Hello, this file is indeed generated by the evaluation script and it records many key details. Specifically, it is generated at this location: https://github.com/opendilab/DI-engine/blob/main/ding/config/config.py#L465. During the evaluation process, statistical information is written into this file. We recommend retaining the file for subsequent reviews and checks. If you prefer not to keep the file, it currently can only be deleted manually.

Thank you for your detailed help.