Validate LAC/SAC pytorch translation
Closed this issue · 8 comments
User story
In order to be able to ship the LAC/SAC pytorch implementation to the team we need to validate whether it gives the same results as the LAC/SAC tensorflow version.
Considerations
Validate SAC (LAC use_lyapunov=False)
- Run both spinning up SAC and LAC (use_lyapunov=False) to see if they give the same results.
- In the meantime quickly compare both codes to see if they are fully compatible.
Validate LAC
- Run LAC of Han 2019 and the new LAC to see if they give the same results.
- In the meantime quickly compare both codes to see if they are fully compatible.
Acceptance criteria
- Both codes give the same results when started with similar parameters.
Validate LAC
It appears that the LAC PyTorch implementation has a higher offset than the SAC and LAC tensorflow implementations:
The performance becomes worse after more training steps:
Differences between the pytorch implementation and the one of Minghoa
-
Minghoa uses log_alpha in the alpha_loss formula (See L116). I use alpha since this is in line with how Harnooja 2019 (see L254) performs automatic temperature tuning. I don't think this should matter much since they are both increasing functions at x-> but maybe there is a good reason Minghoa uses log_alpha.
-
In the loss_lambda formula, Minghoa also uses log_lambda where I would expect him to use lambda (See L115).
Since we did not yet find what causes the difference between the 2 implementations I will do the following:
- Modify the logger such that it displays the variables, in the same way, a Han et al. 2019.
- Double-check the used hyperparameters.
- Compare the PyTorch code with the article again to see if there is something that is unclear.
- Compare the PyTorch code with the TensorFlow code again to see if I missed something.
- Double-check the action clamping.
- Double-check whether
use_lyapunov
disables the right parts.
- Add Tensorboard logging to TensorFlow version so we can visually compare the two implementations.
- Print TensorFlow network graph to Tensorboard so we can compare it with the graph in the PyTorch implementation.
After that I can also:
- Check if there is a problem with the inference script. This can be done by adding robustnes_eval.py to the PyTorch implementation.
- Set the random seeds, and initial conditions equal in both scripts and use eager execution to compare the exact outputs of all of the steps between the PyTorch and TensorFlow implementation.
Double-check hyperparameters
Hyperparameter translation Han vs Mine
max_global_steps
:steps_per_epoch
*epochs
num_of_trials
: This is the number of random seeds (agents) you train - Does not exist in my implementation I only train 1 agent for thesteps_per_epoch
*epochs
.start_of_trail
: This is the start index of the folder in which the random seeds are saved. Example (start_of_trail=4
:./LAC20200827_0046/4/
,./LAC20200827_0046/5/
ect - Does not exist in my implementation.num_of_evaluation_paths
: The number of rollouts/trajectories that are used in the test run) - In my implementationnum_test_episodes
.num_of_training_paths
: Does not exist in the PyTorch implementation. In addition to the ReplayBuffer Han also stores the trajectories for the rollouts. It is these trajectories that are used during the performance evaluation (The values that are printed to the)steps_per_cycle
: This is the number of steps taken before performing the STG update - In my implemenationupdate_every
.train_per_cycle
: The number of SGD passes to perform with every STG cycle - Doesn't exist in my implementation. I simply locked the ratio of env steps to gradient steps to 1. Meaning afterupdate_every
the SGD will runupdate_every
times. In Minghoas code this means that aftersteps_per_cycle
the SGD will be performedtrain_per_cycle
times.- Added this to my implementation to see if it has any effect.
⚠️ evaluation_frequency
: Means-End of epoch handling (Save model, test performance and log data)) - In my scriptsteps_per_epoch
!!! Might be confusing !!!adaptive_alpha
: Whether we want to also train the alpha or keep it fixed - In my implementationtarget_entropy
. This variable can have 3 values. If you supply it with "auto" the algorithm will automatically determine requiredalpha_target
based on the action space size. If you supply a float thealpha_target
will be set equal to this float. If you supply it with None the alpha will not be trained.
Differences that still exist between the two codes
- Minhoa throws away the first 1000 steps which are used to fill the memory buffer (min_memory_size=1000) and starts counting steps after that. Will not have an effect other than reducing the max_step with 1000]
- Minghoa also has the option to use a finite time
Differences that still exist between the (translated) PyTorch LAC and TensorFlow LAC
The SquashedGaussian actor is to complex for a one-to-one translation. I, therefore, had to use the nn.Module class instead of the nn.Sequential class. When doing this I, however, found a small difference between the LAC code and the SAC (spinning up) class.
LAC returns a squashed deterministic action during interference.
In both Minghoas code and the code of Haarnoja et. al 2019 ([see L125])(https://github.com/haarnoja/sac/blob/8258e33633c7e37833cc39315891e77adfbe14b2/sac/policies/gaussian_policy.py#L125)) the (deterministic) clipped_mu
which comes from the mu.network()
is squashed with the Tanh function. In the spinning up version, this is not done.
mu = self.mu_layer(net_out)
clipped_mu = mu
mu = tf.layers.dense(net_1, self.a_dim, activation= None, name='a', trainable=trainable)
clipped_mu = squash_bijector.forward(mu)
- Maybe something to look at for gradients in later debugging torch.clamp gradient problem.
This issue was fixed and will be shipped with the next release. See #18 for the release report.
Differences that still exist between the (translated) PyTorch LAC and TensorFlow LAC
The SquashedGaussian actor is to complex for a one-to-one translation. I, therefore, had to use the nn.Module class instead of the nn.Sequential class. When doing this I, however, found a small difference between the LAC code and the SAC (spinning up) class.
LAC returns a squashed deterministic action during interference.
In both Minghoas code and the code of Haarnoja et. al 2019 ([see L125])(https://github.com/haarnoja/sac/blob/8258e33633c7e37833cc39315891e77adfbe14b2/sac/policies/gaussian_policy.py#L125)) the (deterministic)
clipped_mu
which comes from themu.network()
is squashed with the Tanh function. In the spinning up version, this is not done.mu = self.mu_layer(net_out) clipped_mu = mumu = tf.layers.dense(net_1, self.a_dim, activation= None, name='a', trainable=trainable) clipped_mu = squash_bijector.forward(mu)
I double-checked this again and spinning up does squash the mu (see https://github.com/openai/spinningup/blob/038665d62d569055401d91856abb287263096178/spinup/algos/pytorch/sac/core.py#L64).