redundant max in double dqn

Question

redundant max in double dqn

DongukJu opened this issue 5 years ago · 4 comments

DongukJu commented 5 years ago

In double dqn, I found that there is max Q(~~~, argmaxQ(~~~)).

Do we need max even though we have argmax in Q?

I think the max is redundant.

Would you kindly check this for reducing confusion?

Answer 1 · 2019-12-16T09:42:49.000Z

Hi. @DongukJu

As you said, equation 1 and 2 mean same thing. But equation 2 is a variation of the expression to transform it into double Q-learning. As you look at equation 3 closely, the two Q values are different two thetas (theta, theta '). Double Q-learning is a way to reduce over-estimation by updating each other using two Q-values. To explain the process of change, therefore, it would be better to use the expression as it is.

equation 1.

equation 2.

equation 3.

Thank you!

Answer 2 · 2019-12-16T11:35:22.000Z

Dear @MrSyee,

Thanks for your reply.

As you mentioned, I agree with the importance of the variation.

What I meant was that,

max_a in eq 2 and eq 3 is redundant.

As far as I understand correctly, the max is doing nothing there.

If you want to convey the same intuition to the readers, we might allow this redundancy, but it may still cause confusion.

The max is supposed to do something, but nothing.

def _compute_dqn_loss(self, samples: Dict[str, np.ndarray]) -> torch.Tensor:
    """Return dqn loss."""
    device = self.device  # for shortening the following lines
    state = torch.FloatTensor(samples["obs"]).to(device)
    next_state = torch.FloatTensor(samples["next_obs"]).to(device)
    action = torch.LongTensor(samples["acts"].reshape(-1, 1)).to(device)
    reward = torch.FloatTensor(samples["rews"].reshape(-1, 1)).to(device)
    done = torch.FloatTensor(samples["done"].reshape(-1, 1)).to(device)
    
    # G_t   = r + gamma * v(s_{t+1})  if state != Terminal
    #       = r                       otherwise
    curr_q_value = self.dqn(state).gather(1, action)
    next_q_value = self.dqn_target(next_state).gather(  # Double DQN
        1, self.dqn(next_state).argmax(dim=1, keepdim=True)
    ).detach()
    mask = 1 - done
    target = (reward + self.gamma * next_q_value * mask).to(self.device)

    # calculate dqn loss
    loss = F.smooth_l1_loss(curr_q_value, target)

    return loss

Again, in your code, there is only one argmax for next_q_value, not max and argmax.

Would you kindly clarify this?

Answer 3 · 2019-12-16T16:35:22.000Z

Dear @DongukJu

Oh. You're right. I'm sorry that I couldn't figure out the typo even after reading your comment.
I'll modify this problem.

Thanks for your insight.

Answer 4 · 2019-12-17T05:35:48.000Z

The typo is fixed. Thanks.