CPO for discrete action, change the original loss to something like policy gradient (original TRPO loss is cool in math but bad in performance). Maybe upload continuous action version in the future (if I am happy). Step function of environment should also return a cost value. More dangerous action should return higher cost.
https://arxiv.org/abs/1705.10528
inherit some functions from https://github.com/ajlangley/cpo-pytorch
Tune self.max_kl, the smaller the value, the slower the training speed, but more stable. If you have stability problem, decrease this parameter.
self.max_J_c can be increased a little.