rickstaa/stable-learning-control

Validate LAC/SAC pytorch translation

Closed this issue · 8 comments

User story

In order to be able to ship the LAC/SAC pytorch implementation to the team we need to validate whether it gives the same results as the LAC/SAC tensorflow version.

Considerations

Validate SAC (LAC use_lyapunov=False)

  • Run both spinning up SAC and LAC (use_lyapunov=False) to see if they give the same results.
  • In the meantime quickly compare both codes to see if they are fully compatible.

Validate LAC

  • Run LAC of Han 2019 and the new LAC to see if they give the same results.
  • In the meantime quickly compare both codes to see if they are fully compatible.

Acceptance criteria

  • Both codes give the same results when started with similar parameters.

Validate SAC (LAC use_lyapunov=False)

The results seem to be equal. We can therefore safely assume the Spinning up SAC implementation is equal to the LAC implementation with use_lyapunov disabled.

Learning parameters

Inference

image

Spinning up

image

LAC (use_lyapunov disabled)

image

Validate LAC

It appears that the LAC PyTorch implementation has a higher offset than the SAC and LAC tensorflow implementations:

The performance becomes worse after more training steps:

image

Differences between the pytorch implementation and the one of Minghoa

  1. Minghoa uses log_alpha in the alpha_loss formula (See L116). I use alpha since this is in line with how Harnooja 2019 (see L254) performs automatic temperature tuning. I don't think this should matter much since they are both increasing functions at x-> but maybe there is a good reason Minghoa uses log_alpha.

  2. In the loss_lambda formula, Minghoa also uses log_lambda where I would expect him to use lambda (See L115).

Since we did not yet find what causes the difference between the 2 implementations I will do the following:

  • Modify the logger such that it displays the variables, in the same way, a Han et al. 2019.
  • Double-check the used hyperparameters.
  • Compare the PyTorch code with the article again to see if there is something that is unclear.
  • Compare the PyTorch code with the TensorFlow code again to see if I missed something.
    • Double-check the action clamping.
    • Double-check whether use_lyapunov disables the right parts.
  • Add Tensorboard logging to TensorFlow version so we can visually compare the two implementations.
  • Print TensorFlow network graph to Tensorboard so we can compare it with the graph in the PyTorch implementation.

After that I can also:

  • Check if there is a problem with the inference script. This can be done by adding robustnes_eval.py to the PyTorch implementation.
  • Set the random seeds, and initial conditions equal in both scripts and use eager execution to compare the exact outputs of all of the steps between the PyTorch and TensorFlow implementation.

Double-check hyperparameters

Hyperparameter translation Han vs Mine

  • max_global_steps: steps_per_epoch * epochs
  • num_of_trials: This is the number of random seeds (agents) you train - Does not exist in my implementation I only train 1 agent for the steps_per_epoch * epochs.
  • start_of_trail: This is the start index of the folder in which the random seeds are saved. Example (start_of_trail=4: ./LAC20200827_0046/4/,./LAC20200827_0046/5/ ect - Does not exist in my implementation.
  • num_of_evaluation_paths: The number of rollouts/trajectories that are used in the test run) - In my implementation num_test_episodes.
  • num_of_training_paths: Does not exist in the PyTorch implementation. In addition to the ReplayBuffer Han also stores the trajectories for the rollouts. It is these trajectories that are used during the performance evaluation (The values that are printed to the)
  • steps_per_cycle: This is the number of steps taken before performing the STG update - In my implemenation update_every.
  • train_per_cycle: The number of SGD passes to perform with every STG cycle - Doesn't exist in my implementation. I simply locked the ratio of env steps to gradient steps to 1. Meaning after update_every the SGD will run update_every times. In Minghoas code this means that after steps_per_cycle the SGD will be performed train_per_cycle times.
    • Added this to my implementation to see if it has any effect.
  • ⚠️ evaluation_frequency: Means-End of epoch handling (Save model, test performance and log data)) - In my script steps_per_epoch !!! Might be confusing !!!
  • adaptive_alpha: Whether we want to also train the alpha or keep it fixed - In my implementation target_entropy. This variable can have 3 values. If you supply it with "auto" the algorithm will automatically determine required alpha_target based on the action space size. If you supply a float the alpha_target will be set equal to this float. If you supply it with None the alpha will not be trained.

Differences that still exist between the two codes

  • Minhoa throws away the first 1000 steps which are used to fill the memory buffer (min_memory_size=1000) and starts counting steps after that. Will not have an effect other than reducing the max_step with 1000]
  • Minghoa also has the option to use a finite time

Differences that still exist between the (translated) PyTorch LAC and TensorFlow LAC

The SquashedGaussian actor is to complex for a one-to-one translation. I, therefore, had to use the nn.Module class instead of the nn.Sequential class. When doing this I, however, found a small difference between the LAC code and the SAC (spinning up) class.

LAC returns a squashed deterministic action during interference.

In both Minghoas code and the code of Haarnoja et. al 2019 ([see L125])(https://github.com/haarnoja/sac/blob/8258e33633c7e37833cc39315891e77adfbe14b2/sac/policies/gaussian_policy.py#L125)) the (deterministic) clipped_mu which comes from the mu.network() is squashed with the Tanh function. In the spinning up version, this is not done.

SAC version (L49)

mu = self.mu_layer(net_out)
clipped_mu = mu

LAC version (L244)

mu = tf.layers.dense(net_1, self.a_dim, activation= None, name='a', trainable=trainable)
clipped_mu = squash_bijector.forward(mu)

This issue was fixed and will be shipped with the next release. See #18 for the release report.

Differences that still exist between the (translated) PyTorch LAC and TensorFlow LAC

The SquashedGaussian actor is to complex for a one-to-one translation. I, therefore, had to use the nn.Module class instead of the nn.Sequential class. When doing this I, however, found a small difference between the LAC code and the SAC (spinning up) class.

LAC returns a squashed deterministic action during interference.

In both Minghoas code and the code of Haarnoja et. al 2019 ([see L125])(https://github.com/haarnoja/sac/blob/8258e33633c7e37833cc39315891e77adfbe14b2/sac/policies/gaussian_policy.py#L125)) the (deterministic) clipped_mu which comes from the mu.network() is squashed with the Tanh function. In the spinning up version, this is not done.

SAC version (L49)

mu = self.mu_layer(net_out)
clipped_mu = mu

LAC version (L244)

mu = tf.layers.dense(net_1, self.a_dim, activation= None, name='a', trainable=trainable)
clipped_mu = squash_bijector.forward(mu)

I double-checked this again and spinning up does squash the mu (see https://github.com/openai/spinningup/blob/038665d62d569055401d91856abb287263096178/spinup/algos/pytorch/sac/core.py#L64).