First LAC/SAC training results
Closed this issue · 0 comments
This issue contains the first results I achieved while trying to reproduce the results of Han et al. 2020.
Oscillator
Convergence
lac_oscillator_exp.zip
sac_oscillator_exp.zip
It looks like both algorithms train but the SAC algorithm converges faster. Since LAC has to adhere to an additional stability constraint, it needs more time to converge. This is different from what Minghoa showed in his paper.
Stabilisation
My results
Article
Possible reasons for the difference
-
$\alpha_3$ : I used$\alpha3=0.1$ for the training. The exact value used by Han et al. is not clear since the Article states1.0
and the code uses0.1-0.2
. The higher the value, the more complicated the optimization problem. -
$\alpha$ : Han et al. in his experiments used a starting$\alpha$ of$1.0$ for SAC and$2.0$ for LAC. ❌- This causes LAC to explore more.
-
Finite horizon: From Table s1 from the article, it looks like the finite horizon version was used.
- This could cause the learning to speed up since it is overtraining more to a given signal. Additionally, this would mean that the current hyperparemters are optimized for the finite horizon case not the infinite horizon case.
-
Actor architecture: Code uses
$[64, 64]$ but article states$[256, 256]$ (see https://github.com/rickstaa/Actor-critic-with-stability-guarantee/blob/8a90574fae550e98a9b628bbead6da7f91a51fff/variant.py#L122). I decided to use$[256,256]$ .- Since both SAC and LAC are scaled down in Han's case, this will likely cause a different effect than we see.
- Learning rate decrease: I decrease it per epoch, whereas Han et al. do it per step.
-
train per cyle: Han in his experiments set the train per epoch to
$50$ for SAC and$80$ for LAC. ❌- Meaning that LAC received more training updates per epoch.
- Seeds: Some of my LAC seeds take a lot of work to train. Han didn't store their seeds and their computer architecture was different, so I can only partially reproduce their results. I, however, average the results over ten policies.
Especially points 2 and 7 are design flaws in Han's experiments.
Results when setting points 2 and 6 equal to Han's experiment.
lac_oscillator_han_2020_exp.zip
sac_oscillator_han_2020_exp.zip
Results are not changed significantly but the performance difference is small.
OscillatorComp
Convergence
lac_oscillator_complicated_exp.zip
sac_oscillator_complicated.exp.zip
Similar to regular Oscillator environments, my results are slightly different from the article.
Stabilisation
My results
Article
CartPoleCost
Convergence
Short training
lac_cartpole_cost_exp.zip
sac_cartpole_cost_exp.zip
Longer training
lac_cartpole_cost_long_exp.zip
sac_cartpole_cost_long_exp.zip
In both cases, the results are equal to the ones in the article.
Stabilisation
My results
After a short training
After a longer training
Article
Performance
LAC has better performance.
LAC (long)
-------------------------------------
| AverageEpRet | 8.2 |
| StdEpRet | 7.11 |
| MaxEpRet | 29.5 |
| MinEpRet | 0.229 |
| AverageEpLen | 250 |
-------------------------------------
SAC (long)
-------------------------------------
| AverageEpRet | 14 |
| StdEpRet | 8.86 |
| MaxEpRet | 62 |
| MinEpRet | 1.02 |
| AverageEpLen | 250 |
-------------------------------------
Swimmer
Convergence
lac_swimmer_cost_exp.zip
sac_swimmer_cost_exp.zip
In my result, the difference between LAC and SAC is more prominent.
My results
Article
Performance
When looking at the policy of both SAC and LAC, SAC is not able to have a good performance. It exhibits strange suboptimal locomotion, using a lot of force in the beginning and then sliding along without doing anything.
LAC
-------------------------------------
| AverageEpRet | 146 |
| StdEpRet | 2.74 |
| MaxEpRet | 155 |
| MinEpRet | 141 |
| AverageEpLen | 250 |
-------------------------------------
SAC
-------------------------------------
| AverageEpRet | 185 |
| StdEpRet | 1.64 |
| MaxEpRet | 188 |
| MinEpRet | 182 |
| AverageEpLen | 250 |
-------------------------------------