SamsungLabs/tqc_pytorch

failed to reproduce the results: the reward curves are flat

Closed this issue · 12 comments

thanks for sharing the code, but I'm unable to reproduce the results.

I tested on Hopper-v2 and Walker2d-v2, the reward curve of the algorithm is flat

I'm curious if there are some bugs in the code? Could the authors look into it? thank you!

Hello! I'll look into it.
Are the returns zeros-flat, or are they returns of a very bad agent?

We have benchmarked this version of code, and it worked as TF-version on v3 versions of the environments.
As far as i remember, v2 and v3 have no functional difference in most of the environments.
I have tested Walker2d-v2 and Hopper-v2, and optimization clearly starts, the returns are not flat.

Can you specify your python environment, Mujoco version and the code you run?

emm... I'm using PyTorch 1.5.1 and python 3.6.2, gym version is 0.12.5, mujoco_py is 2.0.2.2. The code I'm using is exactly this code.
I believe the python environment is fine because I tested other PyTorch RL codes and were able to reproduce results.

We use MuJoCo 1.5 because combination of Gym + MuJoCo 2.0 has an integration bug. Please try with MuJoCo 1.5. If there is such a big difference, i will probably look into it.

Also, please post learning curve(s?) and parameters that you use.

The agent starts learning after changing PyTorch to 1.3 and Mujoco to 1.5, thanks for the help! I guess the reason is mainly due to mujoco version difference

FYI, an unrelated issue is the code throws an error with PyTorch 1.5.1, you may want to update it to be more compatible.

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [512, 25]], which is output 0 of TBackward, is at version 2; expected version
1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

Yes, you are right, with PyTorch 1.5.1 the script doesn't work at all.
I have tried MuJoCo 2.0 + Walker2d-v2, and the agents are clearly learning. I didn't train them to the end, but the returns aren't flat. So, maybe you have had a problem in some other place.

Yes I am experiencing this when I run with an updated version of pytorch and gym. Any possible solutions to overcome this issue? This is happening when the actor_loss,backward() is computed.

You need to change the order of critic and actor backward steps.

so in trainer.train do we do the following?

compute target

target = reward + not_done * self.discount * (sorted_z_part - alpha * next_log_pi)

cur_z = self.critic(state, action)
critic_loss = quantile_huber_loss_f(cur_z, target)

#--- Policy and alpha loss ---
new_action, log_pi = self.actor(state)
alpha_loss = -self.log_alpha * (log_pi + self.target_entropy).detach().mean()
actor_loss = (alpha * log_pi - self.critic(state, new_action).mean(2).mean(1, keepdim=True)).mean()

UPDATE ACTOR FIRST

self.actor_optimizer.zero_grad()
actor_loss.backward()
self.actor_optimizer.step()

--- Update ---

self.critic_optimizer.zero_grad()
critic_loss.backward()
self.critic_optimizer.step()

for param, target_param in zip(self.critic.parameters(), self.critic_target.parameters()):
target_param.data.copy_(self.tau * param.data + (1 - self.tau) * target_param.data)

self.alpha_optimizer.zero_grad()
alpha_loss.backward()
self.alpha_optimizer.step()

self.total_it += 1

Yes, actor first, then critic.
You can test it, the error (or warning) should go away.

Perfect, it worked.