failed to reproduce the results: the reward curves are flat

Question

failed to reproduce the results: the reward curves are flat

Closed this issue 4 years ago · 12 comments

araonblake commented 4 years ago

thanks for sharing the code, but I'm unable to reproduce the results.

I tested on Hopper-v2 and Walker2d-v2, the reward curve of the algorithm is flat

I'm curious if there are some bugs in the code? Could the authors look into it? thank you!

Answer 1 · 2020-07-08T14:05:12.000Z

Hello! I'll look into it.
Are the returns zeros-flat, or are they returns of a very bad agent?

Answer 2 · 2020-07-09T15:37:34.000Z

We have benchmarked this version of code, and it worked as TF-version on v3 versions of the environments.
As far as i remember, v2 and v3 have no functional difference in most of the environments.
I have tested Walker2d-v2 and Hopper-v2, and optimization clearly starts, the returns are not flat.

Can you specify your python environment, Mujoco version and the code you run?

Answer 3 · 2020-07-09T16:37:31.000Z

emm... I'm using PyTorch 1.5.1 and python 3.6.2, gym version is 0.12.5, mujoco_py is 2.0.2.2. The code I'm using is exactly this code.
I believe the python environment is fine because I tested other PyTorch RL codes and were able to reproduce results.

Answer 4 · 2020-07-09T16:44:12.000Z

We use MuJoCo 1.5 because combination of Gym + MuJoCo 2.0 has an integration bug. Please try with MuJoCo 1.5. If there is such a big difference, i will probably look into it.

Answer 5 · 2020-07-09T16:55:50.000Z

Also, please post learning curve(s?) and parameters that you use.

Answer 6 · 2020-07-09T23:02:52.000Z

The agent starts learning after changing PyTorch to 1.3 and Mujoco to 1.5, thanks for the help! I guess the reason is mainly due to mujoco version difference

FYI, an unrelated issue is the code throws an error with PyTorch 1.5.1, you may want to update it to be more compatible.

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [512, 25]], which is output 0 of TBackward, is at version 2; expected version
1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

Answer 7 · 2020-07-13T16:18:59.000Z

Yes, you are right, with PyTorch 1.5.1 the script doesn't work at all.
I have tried MuJoCo 2.0 + Walker2d-v2, and the agents are clearly learning. I didn't train them to the end, but the returns aren't flat. So, maybe you have had a problem in some other place.

Answer 8 · 2022-06-18T12:19:13.000Z

Yes I am experiencing this when I run with an updated version of pytorch and gym. Any possible solutions to overcome this issue? This is happening when the actor_loss,backward() is computed.

Answer 9 · 2022-06-18T13:06:40.000Z

You need to change the order of critic and actor backward steps.

Answer 10 · 2022-06-18T13:09:26.000Z

so in trainer.train do we do the following?

compute target

target = reward + not_done * self.discount * (sorted_z_part - alpha * next_log_pi)

cur_z = self.critic(state, action)
critic_loss = quantile_huber_loss_f(cur_z, target)

#--- Policy and alpha loss ---
new_action, log_pi = self.actor(state)
alpha_loss = -self.log_alpha * (log_pi + self.target_entropy).detach().mean()
actor_loss = (alpha * log_pi - self.critic(state, new_action).mean(2).mean(1, keepdim=True)).mean()

UPDATE ACTOR FIRST

self.actor_optimizer.zero_grad()
actor_loss.backward()
self.actor_optimizer.step()

--- Update ---

self.critic_optimizer.zero_grad()
critic_loss.backward()
self.critic_optimizer.step()

for param, target_param in zip(self.critic.parameters(), self.critic_target.parameters()):
target_param.data.copy_(self.tau * param.data + (1 - self.tau) * target_param.data)

self.alpha_optimizer.zero_grad()
alpha_loss.backward()
self.alpha_optimizer.step()

self.total_it += 1

Answer 11 · 2022-06-18T13:16:10.000Z

Yes, actor first, then critic.
You can test it, the error (or warning) should go away.

Answer 12 · 2022-06-18T16:14:36.000Z

Perfect, it worked.