PKU-Alignment/Safe-Policy-Optimization

Question about logger value

lijie9527 opened this issue · 2 comments

                if done or time_out:
                    rew_deque.append(ep_ret[idx])
                    cost_deque.append(ep_cost[idx])
                    len_deque.append(ep_len[idx])
                    logger.store(
                        **{
                            "Metrics/EpRet": np.mean(rew_deque),
                            "Metrics/EpCost": np.mean(cost_deque),
                            "Metrics/EpLen": np.mean(len_deque),
                        }
                    )
                    ep_ret[idx] = 0.0
                    ep_cost[idx] = 0.0
                    ep_len[idx] = 0.0

I'm confused about this np.mean(cost_deque), this would make the EpCost for different epochs correlate, making ep_costs = logger.get_stats("Metrics/EpCost") different from the Jc definition in the safe RL paper. The purpose of this is to utilize the data from many previous episodes to average out the newly added data, which helps improve training stability, and will make the drawn training curve look smoother?

Gaiejj commented

Yes. Lagrangian methods often lack stability and may oscillate drastically with the changes in EpCost. Updating the Lagrange multipliers across epochs with EpCost can enhance the stability of the algorithm and also make the training curve smoother.

Thanks for the quick reply, I get it