Implement LPG

Question

Implement LPG

Closed this issue 3 years ago · 12 comments

Answer 1 · 2021-04-08T08:24:48.000Z

Test training Performance (CartPoleCost)

Let's first test the training performance of the following LAC versions in the CartPoleCost environment:

LAC: The regular LAC algorithm as implemented by Han et al.
LAC2: Version in which the Lyapunov constrained has been removed, but the Lyapunov critic is kept. It is similar to a
SAC with a SINGLE Lyapunov critic.
LAC3: Version without Lyapunov constraint but with the double-q trick.
LAC4: Version with Lyapunov constrained but now added to the Critic Loss.
LAC5: LAC1 but with the entropy term also added to the Critic Loss.
LAC6: Like LAC, the Lyapunov constrained is optimized before it is used (I think, in theory, it makes more sense).

Let's also quickly investigate the following SAC versions:

SAC: Regular sac as implemened by Haarnoja et al..
SAC2: Similar to sac, but now the double Q-trick has been removed.

Regular SAC and LAC performance

LAC

Experiment file: experiments/gpl_2021/lac_cart_pole_cost.yml.

As we already know, LAC works.

Open the report

lac_cart_pole_cost_s0.zip

SAC

Experiment file: experiments/gpl_2021/sac_cart_pole_cost.yml.

As we already know, SAC can also perform on the CartPoleCost environment.

Open the report

sac_cart_pole_cost_s1250.zip

SAC 2

Experiment file: experiments/gpl_2021/sac2_cart_pole_cost.yml.

Seems to work fine.

Open the report

sac2_cart_pole_cost_s1250.zip

LAC 2

Experiment file: experiments/gpl_2021/lac2_cart_pole_cost.yml.

Also works.

Open the report

lac2_cart_pole_cost_s1250.zip

LAC3

Experiment file: experiments/gpl_2021/lac3_cart_pole_cost.yml.

Also works.

Open the report

lac3_cart_pole_cost_s1250.zip

LAC4

Experiment file: experiments/gpl_2021/lac4_cart_pole_cost.yml.

Also works but after this first test, it looks like performance is worse. This could also be due to random factors.

Open the report

lac4_cart_pole_cost_s1250.zip

LAC5

Experiment file: experiments/gpl_2021/lac5_cart_pole_cost.yml.

Works as expected.

Open the report

lac5_cart_pole_cost_s1250.zip

LAC6

Experiment file: experiments/gpl_2021/lac6_cart_pole_cost.yml.

Works as expected.

Open the report

lac6_cart_pole_cost_s1250.zip

Conclusion

All algorithms are able to train. For simplicity let's first work with LAC4 as we can make the other changes later. For this algorithm, we should look at the robustness against disturbances with the original LAC algorithm

Answer 2 · 2021-04-08T17:03:46.000Z

Disturbance robustness evaluation (CartPoleCost)

LAC original results

Seems to work fine

Open the report

Seed 0

lac_cart_pole_cost_s0.zip

Performance

Robustness eval

Seed 1250

LAC4 results

Seems to give the same results as the original lac.

Open the report

Seed 0

lac4_cart_pole_cost_s0.zip

Performance

Robustness eval

Seed 1250

SAC original results

Like in Han et al. 2020 the robustness is lower than the LAC algorithm. Related to that the algorithm also has a higher deadrate.

Open the report

Seed 0

Performance

sac_cart_pole_cost_s0.zip

Robustness eval

Answer 3 · 2021-04-17T12:18:31.000Z

Disturbance robustness evaluation (Oscillator)

LAC original results

Open the report

Seed 0

Seed 1250

Performance

Robustness eval

Look at K

Look at a1 (c1)

LAC4 results

Seems to give the same results as the original lac.

Open the report

Seed 0

Seed 1250

Performance

Robustness eval

Look at K

Look at a1 (c1)

SAC original results

Open the report

Seed 1250

Performance

Robustness eval

Look at K

Look at a1 (c1)

Answer 4 · 2021-04-17T16:23:19.000Z

Meeting notes 17-04-2021

We were able to improve the LAC robustness by only using the Lyapunov Value that came from the best action given the current policy.
We found out that the alpha3*R term can be dropped and a simple alpha3 term can be used. This results in a softer version of Lyapunov stability (derivative is less negative), but this version can be used to make any cost function stable in the sense of Lyapunov (more practical).
We further found the following problems in which we might possibly test the new LAC algorithm in the future.
- Mark time Humanoid: Like a soldier marching in place. Can also include upper body movements or frequency requirements.
- Cheetath: Hopping in place + frequencies
- Bicycle: Maybe in the future we can use this or [this environment](Cheetath: Hopping in place + frequencies).
- Drone: We leave it for now but maybe later we can use flightmare.
- Car tracking: We can do the steering manouvre test with this simulator.
- Cubli walking: Gyroscopic cube https://www.youtube.com/watch?v=n_6p-1J551Y.
- Full cheetah: Like boston dynamics.

Answer 5 · 2021-04-17T16:23:35.000Z

Evaluate LAC robustness

@panweihit Let's evaluate the new lac4 and compare it with SAC for multiple environments but now let it train for 1e6 steps:

CartPoleCost
Oscillator-v1

Oscillator-v1

LAC

Good performance looks better than SAC but worse than LAC4.

Open Report

Seed 1250

Performance

Robustness

Change K

Change c1

LAC 4

Performance and robustness look better than LAC (could still be seeding). It also looks better than SAC.

Open Report

Seed 0

Performance

Robustness

Change K

Change c1

Conclusion

SAC

Performance and robustness look worse than both LAC versions.

Open Report

Seed 1250

Performance

Robustness

Change K

Change c1

Conclusion

The new LAC4 algorithm in which the Lyapunov constraint is placed in the critic cost function and only the min Lyapunov value is used in the Lyapunov constraint works.
It looks like both the performance and robustness are improved compared to lac. More seeds needed to be sure.
For performance (and robustness) training an agent for 3e5 steps looks to be enough in the Oscillator-v1 environment.
LAC performance is similar to SAC but it has a higher robustness..

Answer 6 · 2021-04-18T16:47:21.000Z

Meeting notes (18-04-2021)

@dds0117 I had a meeting with @panweihit yesterday to discuss the results of the tests above and discuss the continuation of our research. Below you will find the notes to the meeting.

Results discussion

The new LAC algorithm seems to work as good (maybe even better) as the old LAC and SAC algorithms. There are several things we can still investigate:
- We changed the Lyapunov constraint from
```
 l_delta = torch.mean(lya_l_ - l1.detach() + self._alpha3 * r)  # See Han eq. 11
```
to
```
 l_delta = torch.mean(lya_l_.min() - l1.detach() + self._alpha3 * r)  # See Han eq. 11
```
In this new version, only the minimum Lyapunov value of the current best action is used to check if the constraint is violated.
- We also replaced the alpha_3 * r in the equation with a small alpha_3 term (0.0001). This was done to check if this term is vital. From the results above, we can see that the algorithm achieves the same performance even without this term. The alpha_3*r term makes sure that the algorithm is stable in mean cost. This term incorporates some extra information about the system which we are trying to exploit by making our Lyapunov Stability definition more strict. As the algorithm is also robust without this information, this increases the practical relevance of our algorithm since such information might not yet be available for all systems, or the problem might be too hard when using this stricter Lyapunov stability. We as a researcher can use any of the Lyapunov Stability criteriums, (strict) asymptotic stability, exponential stability, (strict) asymptotic stability in mean cost etc. for our algorithm.

Other discussion points

@panweihit pointed me to a very insightful course of MIT given by dr. Russ Tedrake. This course explains as long as your reward is Lyapunov stable (has a decreasing derivative), the system also learns the stable and robust behaviour. I haven't watched the full lecture yet, so that I will update the explanation below later. But here is my current understanding:

This conclusion implies that we don't need to design very complicated stability measures for our robot tasks. A reward that makes sure that the Robot doesn't fall is good enough to ensure stability and robustness. Let's take Boston dynamics spot dog as an example. In this case, we don't need to use a cost function that exploits complicated theoretical stability measures like the Zero-Moment point or the COM being vertically inside the convex hull of its contact points to achieve stable behaviour. According to dr. Russ Tedrake, using such knowledge is merely a bonus. Using a simpler cost function like the perpendicular distance between the robot COM already implicitly encodes the stability. If the robot cannot track this path, it died, so it is learning stable behaviour when our Lyapunov values are always decreasing. This greatly increases how practical our algorithm is since we can now use our algorithm for learning stable/robust behaviour even when theoretical knowledge about the system's stability is not available. For systems where we have such knowledge, we can use it to get an additional bonus.

What do we need to do now

Currently, I'm finishing several experiments to:

Solve we the CartPole cart is not converging to zero.
Whether the changes we made to the LAC algorithm really we discussed above achieve better stability/robustness.
- Run 3 random seeds to see if LAC4 is better than LAC.
Check the required number of steps to train an agent in the Oscillator and CartPole environments.
- I train the agents for 1e6 and look where the agent performance and robustness stagnate.
Check if increasing the episode length of training to 800 instead of 400 improves the LAC performance and robustness in the Oscillator environment.
Check whether changing the gamma improves the performance and robustness.

I am further adding a value network to the LAC algorithm so that we can replace it with a gaussian process. Replacing it with a gaussian process makes sense since this allows some stochasticity in the value function, making it easier for the agent to train stable behaviour. The discussion is similar in nature to why SAC uses a gaussian actor instead of a deterministic one. Here we now use a stochastic value function instead of a deterministic one. We use a Gaussian process instead of a gaussian network since the value function is convex in nature. I and @panweihit agreed that a Gaussian process would be well able to catch the behaviour while keeping the algorithm simple because of this nature. Your gaussian process will replace the value network of the new LAC algorithm (I will create this algorithm based on the second version of SAC).

The next steps for creating the GPL algorithm, therefore, are as follows:

Add value network to LAC.
Replace it with a gaussian process.
Perform experiments.

Answer 7 · 2021-04-19T09:11:29.000Z

LAC4 Improvements

Take min lypunov target value

@panweihit slightly modified the Lyapunov constraint such that now the minimum Lyapunov value is used in the Lyapunov constrained:

l_delta = torch.mean(lya_l_.min() - l1.detach() + self._alpha3 * r)  # See Han eq. 11

Remove stricter Lyapunov stability

We removed the alpha_3*r term from the Lyapunov constraint.

self._alpha3 = 0.000001 # Small quadratic regulator term to ensure negative definiteness. Without it the derivative can be negative semi definite.
l_delta = torch.mean(lya_l_.min() - l1.detach() + self._alpha3)  # See Han eq. 11

The LAC algorithm trains fine without this.

Answer 8 · 2021-04-19T09:18:12.000Z

Yes，I agree with you. The Gaussian Process value function is finished，but I meet a problem in which we could use GP value function instead of value function directly. Because gaussian process is related with the temporal sequence during training，it would be used Monte-Carlo update instead of Temporal different（TD）update. I am in trouble with it and we can talk about it tomorrow.

Answer 9 · 2021-04-19T09:30:21.000Z

@dds0117 Good point. I wasn't aware that it was a Monte-Carlo method.

Answer 10 · 2021-06-10T12:00:58.000Z

@panweihit, @dds0117 Here is the new model that was trained for the robustness eval of the cart_pole.

lac4_cart_pole_cost_s1250.zip

Robustness eval Instructions

See also https://rickstaa.github.io/bayesian-learning-control/control/eval_robustness.html.

Create conda environment.
Activate conda environment.
Install packages pip install -e .
Put model inside data folder.
Run the following command:

python -m bayesian_learning_control.run eval_robustness ~/Development/work/bayesian-learning-control/data/lac4_cart_pole_cost/lac4_cart_pole_cost_s1250 --disturbance_type=input

See the results

Change the disturbance

To change the disturbance changes the Magnitude inside the DISTURBER_CFG variable in the https://github.com/rickstaa/simzoo/blob/c0f32230f68b7f0353412a848d8b8598cd82d21c/simzoo/common/disturber.py#L61 file.

Answer 11 · 2021-06-12T08:18:28.000Z

Disussion 11-06-2021

@panweihit, @dds0117 For future reference here a small summary of what we found out in our experimentation yesterday:

The performance and robustness of the LAC, LAC4 looks similar. The performance of SAC is similar but the robustness lower.
The robustness is very much dependent on the actor and critic network structure.
When we used a linear (affine) network structure (i.e. [1], [16]) for the actor the agent was not able to find any rewarding behaviour.

Like we discussed I think the main takeaway is that when we implement the gaussian version of the LAC algorithm it should be able to work when the function approximator, (deep) Gaussian process, is big enough to catch the complexity of the system.

Answer 12 · 2022-01-26T10:33:27.000Z

Closed since there are more important things to do first.

User story

Steps

LAC versions Legenda

Test training Performance (CartPoleCost)

Regular SAC and LAC performance

LAC

SAC

SAC 2

LAC 2

LAC3

LAC4

LAC5

LAC6

Conclusion

Disturbance robustness evaluation (CartPoleCost)

LAC original results

Seed 0

Performance

Robustness eval

Seed 1250

LAC4 results

Seed 0

Performance

Robustness eval

Seed 1250

SAC original results

Seed 0

Performance

Robustness eval

Disturbance robustness evaluation (Oscillator)

LAC original results

Seed 0

Seed 1250

Performance

Robustness eval

Look at K

Look at a1 (c1)

LAC4 results

Seed 0

Seed 1250

Performance

Robustness eval

Look at K

Look at a1 (c1)

SAC original results

Seed 1250

Performance

Robustness eval

Look at K

Look at a1 (c1)

Meeting notes 17-04-2021

Evaluate LAC robustness

Oscillator-v1

LAC

Seed 1250

Performance

Robustness

Change K

Change c1

LAC 4

Seed 0

Performance

Robustness

Change K

Change c1

Conclusion

SAC

Seed 1250

Performance

Robustness

Change K

Change c1

Conclusion

Meeting notes (18-04-2021)

Results discussion

Other discussion points

What do we need to do now

LAC4 Improvements

Take min lypunov target value

Remove stricter Lyapunov stability