Implement LPG
Closed this issue · 12 comments
User story
As discussed in the meeting, we want to implement the LPG agent. @panweihit @dds0117 in this report, I will track the progress of this new algorithm.
Steps
- 1. Check if the regular SAC and LAC algorithms train on the CartPole environments.
- 2. Remove the Lyapunov constraint and check if the agent can train on the CartPoleCost environment (lac2).
- If not training, check if the SAC algorithm successfully trains on the environment.
- If this is the case, check if the SAC algorithm without the double Q-trick can train successfully trains in the environment (sac2).
- If needed, add a second Lyapunov Critic network, use the double Q trick, and train it on the environment (lac3)
- Let's add the entropy regularization term of SAC to Critic loss function this is more similar to what SAC is doing (lac5).
- Let's move the Lyapunov optimization before the actor loss update (lac6)
- 3. Move Lyapunov constraint from the actor loss to the critic loss function (lac4).
- 4. Replace the Q values in the critic loss with the actual value function.
- 5. Approximate the value function with a Gaussian Process.
LAC versions Legenda
- LAC: Regular lag.
- LAC2: Lac without any lyapunov constrainted (Similar to SAC but with Squared output activation).
- LAC3: Lac but now with the double Q-trick added.
- LAC4: Lac but not the lyapunov constrained is added to the critic loss instead of the actor loss.
- LAC5: Lac but now we also add the entropy regularization term in to the critic (more theoretically correct).
- LAC6: Lac but now the Lagrance multipliers are optimized before they are used to optimize the critic and actor.
- LAC7: Lac but now we use the minimum lyapunov target in the Lyapunov constrained.
- LAC8: Lac but now we replace the strict asymptotic stability in mean cost with general asymptotic stability.
- SAC: Regular sac.
- SAC2: Sac but without double Q-trick.
- SAC3: Sac but now it uses v1 of Haarnoja et al. 2019.
- SAC4: Sac but now it uses v2 of Haarnoja et al. 2019.
Test training Performance (CartPoleCost)
Let's first test the training performance of the following LAC versions in the CartPoleCost environment:
- LAC: The regular LAC algorithm as implemented by Han et al.
- LAC2: Version in which the Lyapunov constrained has been removed, but the Lyapunov critic is kept. It is similar to a
SAC with a SINGLE Lyapunov critic. - LAC3: Version without Lyapunov constraint but with the double-q trick.
- LAC4: Version with Lyapunov constrained but now added to the Critic Loss.
- LAC5: LAC1 but with the entropy term also added to the Critic Loss.
- LAC6: Like LAC, the Lyapunov constrained is optimized before it is used (I think, in theory, it makes more sense).
Let's also quickly investigate the following SAC versions:
- SAC: Regular sac as implemened by Haarnoja et al..
- SAC2: Similar to sac, but now the double Q-trick has been removed.
Regular SAC and LAC performance
LAC
Experiment file: experiments/gpl_2021/lac_cart_pole_cost.yml
.
As we already know, LAC works.
Open the report
SAC
Experiment file: experiments/gpl_2021/sac_cart_pole_cost.yml
.
As we already know, SAC can also perform on the CartPoleCost environment.
Open the report
SAC 2
Experiment file: experiments/gpl_2021/sac2_cart_pole_cost.yml
.
Seems to work fine.
Open the report
LAC 2
Experiment file: experiments/gpl_2021/lac2_cart_pole_cost.yml
.
Also works.
Open the report
LAC3
Experiment file: experiments/gpl_2021/lac3_cart_pole_cost.yml
.
Also works.
Open the report
LAC4
Experiment file: experiments/gpl_2021/lac4_cart_pole_cost.yml
.
Also works but after this first test, it looks like performance is worse. This could also be due to random factors.
Open the report
LAC5
Experiment file: experiments/gpl_2021/lac5_cart_pole_cost.yml
.
Works as expected.
Open the report
LAC6
Experiment file: experiments/gpl_2021/lac6_cart_pole_cost.yml
.
Works as expected.
Open the report
Conclusion
All algorithms are able to train. For simplicity let's first work with LAC4 as we can make the other changes later. For this algorithm, we should look at the robustness against disturbances with the original LAC algorithm
Disturbance robustness evaluation (CartPoleCost)
LAC original results
Seems to work fine
LAC4 results
Seems to give the same results as the original lac.
SAC original results
Like in Han et al. 2020 the robustness is lower than the LAC algorithm. Related to that the algorithm also has a higher deadrate.
Disturbance robustness evaluation (Oscillator)
LAC original results
LAC4 results
Seems to give the same results as the original lac.
SAC original results
Meeting notes 17-04-2021
- We were able to improve the LAC robustness by only using the Lyapunov Value that came from the best action given the current policy.
- We found out that the
alpha3*R
term can be dropped and a simplealpha3
term can be used. This results in a softer version of Lyapunov stability (derivative is less negative), but this version can be used to make any cost function stable in the sense of Lyapunov (more practical). - We further found the following problems in which we might possibly test the new LAC algorithm in the future.
- Mark time Humanoid: Like a soldier marching in place. Can also include upper body movements or frequency requirements.
- Cheetath: Hopping in place + frequencies
- Bicycle: Maybe in the future we can use this or [this environment](Cheetath: Hopping in place + frequencies).
- Drone: We leave it for now but maybe later we can use flightmare.
- Car tracking: We can do the steering manouvre test with this simulator.
- Cubli walking: Gyroscopic cube https://www.youtube.com/watch?v=n_6p-1J551Y.
- Full cheetah: Like boston dynamics.
Evaluate LAC robustness
@panweihit Let's evaluate the new lac4 and compare it with SAC for multiple environments but now let it train for 1e6
steps:
- CartPoleCost
- Oscillator-v1
Oscillator-v1
LAC
Good performance looks better than SAC but worse than LAC4.
LAC 4
Performance and robustness look better than LAC (could still be seeding). It also looks better than SAC.
SAC
Performance and robustness look worse than both LAC versions.
Conclusion
- The new LAC4 algorithm in which the Lyapunov constraint is placed in the critic cost function and only the min Lyapunov value is used in the Lyapunov constraint works.
- It looks like both the performance and robustness are improved compared to lac. More seeds needed to be sure.
- For performance (and robustness) training an agent for
3e5
steps looks to be enough in the Oscillator-v1 environment. - LAC performance is similar to SAC but it has a higher robustness..
Meeting notes (18-04-2021)
@dds0117 I had a meeting with @panweihit yesterday to discuss the results of the tests above and discuss the continuation of our research. Below you will find the notes to the meeting.
Results discussion
- The new LAC algorithm seems to work as good (maybe even better) as the old LAC and SAC algorithms. There are several things we can still investigate:
- We changed the Lyapunov constraint from
tol_delta = torch.mean(lya_l_ - l1.detach() + self._alpha3 * r) # See Han eq. 11
In this new version, only the minimum Lyapunov value of the current best action is used to check if the constraint is violated.l_delta = torch.mean(lya_l_.min() - l1.detach() + self._alpha3 * r) # See Han eq. 11
- We also replaced the
alpha_3 * r
in the equation with a smallalpha_3
term (0.0001). This was done to check if this term is vital. From the results above, we can see that the algorithm achieves the same performance even without this term. Thealpha_3*r
term makes sure that the algorithm is stable in mean cost. This term incorporates some extra information about the system which we are trying to exploit by making our Lyapunov Stability definition more strict. As the algorithm is also robust without this information, this increases the practical relevance of our algorithm since such information might not yet be available for all systems, or the problem might be too hard when using this stricter Lyapunov stability. We as a researcher can use any of the Lyapunov Stability criteriums, (strict) asymptotic stability, exponential stability, (strict) asymptotic stability in mean cost etc. for our algorithm.
Other discussion points
@panweihit pointed me to a very insightful course of MIT given by dr. Russ Tedrake. This course explains as long as your reward is Lyapunov stable (has a decreasing derivative), the system also learns the stable and robust behaviour. I haven't watched the full lecture yet, so that I will update the explanation below later. But here is my current understanding:
This conclusion implies that we don't need to design very complicated stability measures for our robot tasks. A reward that makes sure that the Robot doesn't fall is good enough to ensure stability and robustness. Let's take Boston dynamics spot dog as an example. In this case, we don't need to use a cost function that exploits complicated theoretical stability measures like the Zero-Moment point or the COM being vertically inside the convex hull of its contact points to achieve stable behaviour. According to dr. Russ Tedrake, using such knowledge is merely a bonus. Using a simpler cost function like the perpendicular distance between the robot COM already implicitly encodes the stability. If the robot cannot track this path, it died, so it is learning stable behaviour when our Lyapunov values are always decreasing. This greatly increases how practical our algorithm is since we can now use our algorithm for learning stable/robust behaviour even when theoretical knowledge about the system's stability is not available. For systems where we have such knowledge, we can use it to get an additional bonus.
What do we need to do now
Currently, I'm finishing several experiments to:
- Solve we the CartPole cart is not converging to zero.
- Whether the changes we made to the LAC algorithm really we discussed above achieve better stability/robustness.
- Run 3 random seeds to see if LAC4 is better than LAC.
- Check the required number of steps to train an agent in the Oscillator and CartPole environments.
- I train the agents for 1e6 and look where the agent performance and robustness stagnate.
- Check if increasing the episode length of training to 800 instead of 400 improves the LAC performance and robustness in the Oscillator environment.
- Check whether changing the gamma improves the performance and robustness.
I am further adding a value network to the LAC algorithm so that we can replace it with a gaussian process. Replacing it with a gaussian process makes sense since this allows some stochasticity in the value function, making it easier for the agent to train stable behaviour. The discussion is similar in nature to why SAC uses a gaussian actor instead of a deterministic one. Here we now use a stochastic value function instead of a deterministic one. We use a Gaussian process instead of a gaussian network since the value function is convex in nature. I and @panweihit agreed that a Gaussian process would be well able to catch the behaviour while keeping the algorithm simple because of this nature. Your gaussian process will replace the value network of the new LAC algorithm (I will create this algorithm based on the second version of SAC).
The next steps for creating the GPL algorithm, therefore, are as follows:
- Add value network to LAC.
- Replace it with a gaussian process.
- Perform experiments.
LAC4 Improvements
Take min lypunov target value
@panweihit slightly modified the Lyapunov constraint such that now the minimum Lyapunov value is used in the Lyapunov constrained:
l_delta = torch.mean(lya_l_.min() - l1.detach() + self._alpha3 * r) # See Han eq. 11
Remove stricter Lyapunov stability
We removed the alpha_3*r
term from the Lyapunov constraint.
self._alpha3 = 0.000001 # Small quadratic regulator term to ensure negative definiteness. Without it the derivative can be negative semi definite.
l_delta = torch.mean(lya_l_.min() - l1.detach() + self._alpha3) # See Han eq. 11
The LAC algorithm trains fine without this.
Yes,I agree with you. The Gaussian Process value function is finished,but I meet a problem in which we could use GP value function instead of value function directly. Because gaussian process is related with the temporal sequence during training,it would be used Monte-Carlo update instead of Temporal different(TD)update. I am in trouble with it and we can talk about it tomorrow.
@panweihit, @dds0117 Here is the new model that was trained for the robustness eval of the cart_pole.
Robustness eval Instructions
See also https://rickstaa.github.io/bayesian-learning-control/control/eval_robustness.html.
- Create conda environment.
- Activate conda environment.
- Install packages
pip install -e .
- Put model inside data folder.
- Run the following command:
python -m bayesian_learning_control.run eval_robustness ~/Development/work/bayesian-learning-control/data/lac4_cart_pole_cost/lac4_cart_pole_cost_s1250 --disturbance_type=input
- See the results
Change the disturbance
To change the disturbance changes the Magnitude inside the DISTURBER_CFG
variable in the https://github.com/rickstaa/simzoo/blob/c0f32230f68b7f0353412a848d8b8598cd82d21c/simzoo/common/disturber.py#L61 file.
Disussion 11-06-2021
@panweihit, @dds0117 For future reference here a small summary of what we found out in our experimentation yesterday:
- The performance and robustness of the LAC, LAC4 looks similar. The performance of SAC is similar but the robustness lower.
- The robustness is very much dependent on the actor and critic network structure.
- When we used a linear (affine) network structure (i.e. [1], [16]) for the actor the agent was not able to find any rewarding behaviour.
Like we discussed I think the main takeaway is that when we implement the gaussian version of the LAC algorithm it should be able to work when the function approximator, (deep) Gaussian process, is big enough to catch the complexity of the system.
Closed since there are more important things to do first.