Policy Gradients

Teach an agent how to play Lunar Lander with Policy Gradients and TensorFlow.

Policy-based methods avoid learning value function and instrad directly find optimal agent's policy. Simple cross entropy method depends on playing some games with current policy, finding elite games which have rewrad better than others, and directly changing policy based on states and actions in those elite games. Policy gradients method is a bit more sophisticated.

Find optimal policy parameters which maximize return - cummulative discounted rewards:

Gradient of objective function with respect to policy parameters theta:

Gradient can be estimated with Monte Carlo Sampling:

For policy parameters update take only samples from single episode. Loss can be presented as follows:

Minimize loss with gradient descent to find optimal policy.

Do it for each action in episode.

During trainig sample action with respect to probability distibution returned by current policy. Reward shaping is necassaray or at least very helpful. Moreover trainig data could be decorellated by taking trainig batches from many different episodes but it's optional.

wojciechmo/rl-pg

Policy Gradients