p-christ/Deep-Reinforcement-Learning-Algorithms-with-PyTorch

Mean of expectation in SAC_discrete.py possibly wrong?

NikEyX opened this issue · 11 comments

Hi Petros,

in your SAC_discrete code you are using the following in SAC_Discrete.py:

min_qf_next_target = action_probabilities * (torch.min(qf1_next_target, qf2_next_target) - self.alpha * log_action_probabilities)
min_qf_next_target = min_qf_next_target.mean(dim=1).unsqueeze(-1)

So it looks like you're effectively taking the mean of the probability-weighted Q-values in the last line? However, the weights are already provided by action_probabilities, so instead of doing .mean(dim=1) it should be .sum(dim=1) in order to give you the proper expectation of the next q value for each item in the batch. Or what am I missing here?

Running this on LunarLander I also noted that the automatic entropy adjustment doesn't seem to work well. The original SAC paper mentions there is no "upper bound" in the calculation since the value should be tightly coupled to the constrained optimization problem. However, in your implementation the self.alpha value can go very high! I would have expected this value to be close / lower than the initial target_entropy level, which in the case of LunarLander is -log(1/4)*0.98 = 1.35, but I've seen it go to over 20! Do you have an explanation for that by chance?

Thanks for your work, it's very appreciated btw

So I fiddled around more with this, recreated your implementation and compared it to a Gumbel-Softmax implementation of SAC.

Unfortunately I think that the implementation in your code is inaccurate, even though the math in the paper seems to be describing it accurately, namely as expectation. The easiest way to verify that is that setting self.alpha to zero should yield the expected dqn results, e.g. for q-values, it should be the expected q-value. Taking mean() on a probability weighted tensor will reduce the q-values/policy-values by an amount proportional to the number of actions, which cannot be what you want?

I believe the minimum amount of changes to your code require replacing .mean with sum here:

min_qf_next_target = min_qf_next_target.mean(dim=1).unsqueeze(-1)

and here as well:

Doing this will create expected_q_values that match the rewards in the environments - so that's relatively easy to check.

I might be totally wrong on everything above of course, so if anyone has ideas/comments on the above, please let me know

EDIT:
Previously I had a section here on a different approach to the target_entropy, but upon reflection it seems your target alpha loss implementation is somewhat different from mine, so I gonna leave it out of this ticket. If anyone is interested in the future, I can happily paste my version.

I think it's probably better if someone else does that and verifies it independently. Thanks for the offer though :)

Hello,

I had to implement this exact algorithm for a school project. I can confirm that changing min_qf_next_target = min_qf_next_target.mean(dim=1).unsqueeze(-1)
to min_qf_next_target = min_qf_next_target.sum(dim=1).unsqueeze(-1)
as well as
policy_loss = policy_loss.mean()
to
policy_loss = policy_loss.sum(dim=1).mean()

fixed the whole algorithm on my side. I observed the policy to converge toward the uniform distribution (before the change). These two changes saved my day.

Thanks !

@NikEyX , could you paste your version of self.alpha, please? Actually, I am also at a loss about the initialization of alpha.

HI.

I made PullRequest #60 which fixed actor's and critic's losses.
I tested it using CartPole.py and saw the training converged more stably.

Thanks :)

@ku2482 ,@p-christ,
First, thanks for the pr made by @ku2482 , but after testing with CartPole,there is an issue I want to mention:
When training more than 150 episodes, the alpha learned would be very large, personally, I've see 7000 over 500 episodes!
In addition, I found that target_entropy is important to stability of alpha---If it is set like -0.1, 0.2, it is ok, but if it is round 0.68, alpha would diverge for this env.

I don't know what is wrong, since the issue doesn't occur with continuous version.
And at present , I work around it by setting self.alpha = torch.clamp(self.alpha, min=0.0, max=1) after updating alpha.

I have observed the same behavior.
I assume that the cause is target entropy (0.98 * maximum) is too large when |A|=2, and the policy would never reach it so that alpha is always getting larger.

@p-christ
What do you think?

@p-christ
Thank you for your replay :)

@dbsxdbsx
I've tested on CartPole-v0 with various target entropy before. At least for CartPole-v0, 0.8 or 0.7 * maximum seems fine.

スクリーンショット 2020-10-01 21 51 24

Thanks.

@p-christ @ku2482 , thanks for your replies. Hopefully, it is not an algorithm issue.