/Cartpole

Using RL algorithms, REINFORCE and an evolutionary algorithm, to play cartpole using the OpenAI gym

Primary LanguageJupyter NotebookMIT LicenseMIT

Cartpole

REINFORCE is a simple RL algorithm(perhaps too simple), which uses stochatic policy gradient function to find out which action would have the highest probability to give the maximum reward in a particular state.

A policy gradient method is a reinforcement learning approach that tries to directly learn a policy by generally using a parameterized function as a policy function (e.g., a neural network) and training it to increase the probability of actions based on the observed rewards.

REINFORCE is the simplest implementation of a policy gradient method; it essentially maximizes the probability of an action times the observed reward after tak- ing that action, such that each action’s probability (given a state) is adjusted according to the size of the observed reward. REINFORCE is an effective and very simple way of training a policy function, but it’s a little too simple. For CartPole it works very well, since the state space is very small and there are only two actions. If we’re dealing with an environment with many more possible actions, reinforcing all of them each episode and hoping that on average it will only reinforce the good actions becomes less and less reliable.

Training examples:

Initial Stages

ezgif com-gif-maker

Here the pole keeps falling of pretty soon when the training is in the initial stages.

After training

ezgif com-gif-maker (1)

Now the cart is able to travel further without the pole dropping. This distance can be increased by making the network a bit deeper(I have used a very simple sequential network), and by increasing the number of maximum steps from 200 to a greater number. For this you can edit the MAX_DUR variable.

  • Evolutionary algorithms provide us with more powerful tools for our toolkit. Based on biological evolution, we

    • Produce individuals
    • Select the best from the current generation
    • Shuffle the genes around
    • Mutate them to introduce some variation
    • Mate them to create new generations for the next population
  • Evolutionary algorithms tend to be more data hungry and less data-efficient than gradient-based approaches; in some circumstances this may be fine, notably if you have a simulator.

  • Evolutionary algorithms can optimize over nondifferentiable or even discrete functions, which gradient-based methods cannot do.

  • Evolutionary strategies (ES) are a subclass of evolutionary algorithms that do not involve biological-like mating and recombination, but instead use copying with noise and weighted sums to create new individuals from a population.