动作探索选择的问题

我比较了一下 pymarl 和 pymarl2 的代码

发现在 pymarl 的 basic_controller.py 中的这个动作探索选择
https://github.com/oxwhirl/pymarl/blob/c971afdceb34635d31b778021b0ef90d7af51e86/src/controllers/basic_controller.py#L40-L48

if not test_mode:
    # Epsilon floor
    epsilon_action_num = agent_outs.size(-1)
    if getattr(self.args, "mask_before_softmax", True):
        # With probability epsilon, we will pick an available action uniformly
        epsilon_action_num = reshaped_avail_actions.sum(dim=1, keepdim=True).float()

    agent_outs = ((1 - self.action_selector.epsilon) * agent_outs
                   + th.ones_like(agent_outs) * self.action_selector.epsilon/epsilon_action_num)

被移动到了 action_selectors.py 中

pymarl2/src/components/action_selectors.py

Lines 94 to 97 in d0aaf58

    
           epsilon_action_num = (avail_actions.sum(-1, keepdim=True) + 1e-8) 
        
           masked_policies = ((1 - self.epsilon) * masked_policies 
        
                       + avail_actions * self.epsilon/epsilon_action_num) 
        
           masked_policies[avail_actions == 0] = 0

而且计算方式貌似在是否mask上有所不同，请问一下为什么要这样改动哇

基本上是一样的~~~只是原来那样太乱了，探索统一放到 selectors 里面。在VMIX下测试貌似更稳定

好的，另外还有一个小问题哈，我发现在 episode_runner.py 你们改动了一下，将 action 放到 cpu上再 update 到 batch 里面

pymarl2/src/runners/episode_runner.py

Lines 68 to 80 in d0aaf58

    
           # Fix memory leak 
        
           cpu_actions = actions.to("cpu").numpy() 
        
           reward, terminated, env_info = self.env.step(actions[0]) 
        
           episode_return += reward 
        
           post_transition_data = { 
        
               "actions": cpu_actions, 
        
               "reward": [(reward,)], 
        
               "terminated": [(terminated != env_info.get("episode_limit", False),)], 
        
           } 
        
           self.batch.update(post_transition_data, ts=self.t)

但其实在 batch 的 update 函数里有一行是把变量放到 gpu 上（如果使用 gpu），这样你们前面先从 gpu 放回 cpu 是不是有点冗余，还是出于什么原因考虑的呢？

pymarl2/src/components/episode_buffer.py

Line 103 in d0aaf58

v = th.tensor(v, dtype=dtype, device=self.device)

因为buffer 的数据是放在cpu内存中的，所以移过来释放gpu显存。

我debug跟了一下代码，self.batch 是在这里用 self.new_batch() 定义的

pymarl2/src/runners/episode_runner.py

Line 44 in d0aaf58

self.batch = self.new_batch()

而 self.new_batch() 的初始化使用了 self.args.device，如果当前 device 是 cuda 的话就会使用 cuda

pymarl2/src/runners/episode_runner.py

Line 30 in 3f894bc

    
           self.new_batch = partial(EpisodeBatch, scheme, groups, self.batch_size, self.episode_limit + 1,

从而导致这一步的 self.device 是 cuda

pymarl2/src/components/episode_buffer.py

Line 103 in d0aaf58

v = th.tensor(v, dtype=dtype, device=self.device)

然后我看了一下 buffer 的定义是用 cpu 存储的

pymarl2/src/run/run.py

Lines 113 to 115 in d0aaf58

    
           buffer = ReplayBuffer(scheme, groups, args.buffer_size, env_info["episode_limit"] + 1, 
        
                                 preprocess=preprocess, 
        
                                 device="cpu" if args.buffer_cpu_only else args.device)

这里的逻辑应该是先得到 episode_batch （在 gpu 上），再加入到 buffer 中（加入buffer时会将变量放到 cpu 上）

pymarl2/src/run/run.py

Lines 177 to 178 in d0aaf58

    
           episode_batch = runner.run(test_mode=False) 
        
           buffer.insert_episode_batch(episode_batch)

那么在生成 episode_batch 的过程中先从 gpu 转 cpu，又被转回 gpu 就有点冗余了好像？在后面加入 buffer 时都会放 cpu上，不知道我理解有没有问题～

你观察的很仔细这么看 new_batch 的时候要用cpu~

嗯嗯有道理，谢谢啦

	epsilon_action_num = (avail_actions.sum(-1, keepdim=True) + 1e-8)
	masked_policies = ((1 - self.epsilon) * masked_policies
	+ avail_actions * self.epsilon/epsilon_action_num)
	masked_policies[avail_actions == 0] = 0

	# Fix memory leak
	cpu_actions = actions.to("cpu").numpy()

	reward, terminated, env_info = self.env.step(actions[0])
	episode_return += reward

	post_transition_data = {
	"actions": cpu_actions,
	"reward": [(reward,)],
	"terminated": [(terminated != env_info.get("episode_limit", False),)],
	}

	self.batch.update(post_transition_data, ts=self.t)

	buffer = ReplayBuffer(scheme, groups, args.buffer_size, env_info["episode_limit"] + 1,
	preprocess=preprocess,
	device="cpu" if args.buffer_cpu_only else args.device)

	episode_batch = runner.run(test_mode=False)
	buffer.insert_episode_batch(episode_batch)