rickstaa/stable-learning-control

First LAC/SAC training results

Closed this issue · 0 comments

This issue contains the first results I achieved while trying to reproduce the results of Han et al. 2020.

Oscillator

Convergence

lac_oscillator_exp.zip
sac_oscillator_exp.zip

It looks like both algorithms train but the SAC algorithm converges faster. Since LAC has to adhere to an additional stability constraint, it needs more time to converge. This is different from what Minghoa showed in his paper.

Stabilisation

My results

image

Article

image

Possible reasons for the difference

  1. $\alpha_3$: I used $\alpha3=0.1$ for the training. The exact value used by Han et al. is not clear since the Article states 1.0 and the code uses 0.1-0.2. The higher the value, the more complicated the optimization problem.
  2. $\alpha$: Han et al. in his experiments used a starting $\alpha$ of $1.0$ for SAC and $2.0$ for LAC. ❌
    • This causes LAC to explore more.
  3. Finite horizon: From Table s1 from the article, it looks like the finite horizon version was used.
    • This could cause the learning to speed up since it is overtraining more to a given signal. Additionally, this would mean that the current hyperparemters are optimized for the finite horizon case not the infinite horizon case.
  4. Actor architecture: Code uses $[64, 64]$ but article states $[256, 256]$ (see https://github.com/rickstaa/Actor-critic-with-stability-guarantee/blob/8a90574fae550e98a9b628bbead6da7f91a51fff/variant.py#L122). I decided to use $[256,256]$.
    • Since both SAC and LAC are scaled down in Han's case, this will likely cause a different effect than we see.
  5. Learning rate decrease: I decrease it per epoch, whereas Han et al. do it per step.
  6. train per cyle: Han in his experiments set the train per epoch to $50$ for SAC and $80$ for LAC. ❌
    • Meaning that LAC received more training updates per epoch.
  7. Seeds: Some of my LAC seeds take a lot of work to train. Han didn't store their seeds and their computer architecture was different, so I can only partially reproduce their results. I, however, average the results over ten policies.

Especially points 2 and 7 are design flaws in Han's experiments.

Results when setting points 2 and 6 equal to Han's experiment.

lac_oscillator_han_2020_exp.zip
sac_oscillator_han_2020_exp.zip

image

Results are not changed significantly but the performance difference is small.

OscillatorComp

Convergence

lac_oscillator_complicated_exp.zip
sac_oscillator_complicated.exp.zip

Similar to regular Oscillator environments, my results are slightly different from the article.

Stabilisation

My results

image

Article

image

CartPoleCost

Convergence

Short training

lac_cartpole_cost_exp.zip
sac_cartpole_cost_exp.zip

Longer training

lac_cartpole_cost_long_exp.zip
sac_cartpole_cost_long_exp.zip

In both cases, the results are equal to the ones in the article.

Stabilisation

My results

After a short training $1e5$ instead of $1e6$ we get the following result:

image

After a longer training $1e6$ we get the following result:

image

Article

image

Performance

LAC has better performance.

LAC (long)

lac_cartpole

-------------------------------------
|    AverageEpRet |             8.2 |
|        StdEpRet |            7.11 |
|        MaxEpRet |            29.5 |
|        MinEpRet |           0.229 |
|    AverageEpLen |             250 |
-------------------------------------

SAC (long)

sac_cartpole

-------------------------------------
|    AverageEpRet |              14 |
|        StdEpRet |            8.86 |
|        MaxEpRet |              62 |
|        MinEpRet |            1.02 |
|    AverageEpLen |             250 |
-------------------------------------

Swimmer

Convergence

lac_swimmer_cost_exp.zip
sac_swimmer_cost_exp.zip

In my result, the difference between LAC and SAC is more prominent.

My results

image

Article

image

Performance

When looking at the policy of both SAC and LAC, SAC is not able to have a good performance. It exhibits strange suboptimal locomotion, using a lot of force in the beginning and then sliding along without doing anything.

LAC

swimmer_lac

-------------------------------------
|    AverageEpRet |             146 |
|        StdEpRet |            2.74 |
|        MaxEpRet |             155 |
|        MinEpRet |             141 |
|    AverageEpLen |             250 |
-------------------------------------

SAC

swimmer_sac

-------------------------------------
|    AverageEpRet |             185 |
|        StdEpRet |            1.64 |
|        MaxEpRet |             188 |
|        MinEpRet |             182 |
|    AverageEpLen |             250 |
-------------------------------------