rail-berkeley/softlearning

Differences between softlearning implementation and formula 18 in paper of alpha loss

Maggern3 opened this issue · 0 comments

Opening a new issue because the old issue was closed but didn't really explain the differences in the softlearning implementation and the other issue was not reopened on request(yet, 1wk).


self.expected_entropy = -torch.prod(torch.tensor(action_space.shape).to(self.device)).item() # unsure if this is right for multidiscrete env
print('target entropy', self.expected_entropy) # gives target entropy -4
self.log_alpha = torch.tensor(0.0, requires_grad=True, device=self.device)
self.alpha = self.log_alpha.exp() #0.2#, requires_grad=True, device=self.device)#0.2
self.alpha_optimizer = optim.Adam([self.log_alpha], lr=0.003)


# my impl based on formula 18 from paper, crashes
#alpha_loss = (-self.alpha * (log_prob - self.expected_entropy).detach()).mean()
# rail-berkeley/softlearning, crashes
#alpha_loss2 = -1.0 * (self.alpha * (log_prob + self.expected_entropy).detach()).mean()
# cyoon1729/Policy-Gradient-Methods, alpha is less than 0.0 in 80 episodes
#alpha_loss3 = (self.log_alpha * (-log_prob - self.expected_entropy).detach()).mean()    
# vitchyr/rlkit, alpha is less than 0.0 in 52 episodes
# p-christ/Deep-Reinforcement-Learning-Algorithms-with-PyTorch alpha is less than 0.0 in 50 episodes
alpha_loss4 = -(self.log_alpha * (next_log_prob_selected_actions + self.expected_entropy).detach()).mean()
self.alpha = self.log_alpha.exp()
self.alpha_optimizer.zero_grad()
alpha_loss4.backward()
self.alpha_optimizer.step() 

If you could explain the difference in the math between the paper(my implementation above) and softlearning above I'd appreciate it. Why are you using -1 as a multiplier? Why add log_prob and self.expected_entropy, in the paper it's subtracted?

In addition, what would be a good value for expected_entropy or the target entropy for a MultiDiscrete([3 3 2 3]) action space, the Obstacle tower environment?
Depending on how I calculate I get -4, -11 or -54 but I'm not sure what would be a good value. Ascertaining from the other post you linked, -4 or -11 should work. But right now they're not working. Could be due to the the alpha loss function though.

If you need to see full source code it's here

Also one question about the intuition of alpha. If it should be different for each state(the entropy), shouldn't we create it's own neural net? How can one tensor encapsulate different entropy values for each different state? Is it achieved through how we use alpha together with the other losses? It doesn't make sense

Much appreciated, thanks