In this application, you will learn how to use OpenAI gym to create a controller for the classic pole balancing problem. The problem will be solved using Reinforcement Learning. While this topic requires much involved discussion, here we present a simple formulation of the problem that can be efficiently solved using gradient descent. Also note that pole balancing can be solved by classic control theory, e.g., through a PID controller. However, our solution does not require linearization of the system.
To download the [code][polecode], you can either click on the green button to download as a zip, or use [git][giturl]. Please see a git tutorial [here][gitdoc].
- You will first need to install [Python 3.5][pythonurl]. Check if python is
correctly installed by type in the command line
python
. - Install TensorFlow (CPU version) here.
- Once done, go to the folder where you hold this code, type in
pip install -r requirement.txt
This should install all dependancies. 4. You can then run the code by typing in
python main.py
The simple pole balancing (inverse pendulum) setting consists of a pole
and a cart. The system states are the cart displacement
Left-aligned | Center-aligned |
---|---|
mass cart | 1.0 |
mass pole | 0.1 |
pole length | 0.5 |
force | 10.0 |
delta_t | 0.02 |
theta_threshold | 12 (degrees) |
delta_t | 2.4 |
The system equations are ***.
The controller takes in the system states, and outputs a fixed force on the cart to either left or right. The controller needs to be designed so that within 4 seconds, the pole angle does not exceed 12 degrees, and the cart displacement does not exceed 2.4 unit length. A trial will be terminated if the system is balanced for more than 4 seconds, or any of the constraints are violated.
Here we learn a controller in a model-free fashion, i.e., the controller
is learned without understanding of the dynamical system. We first introduce
the concept of a Markov Decision Process: An MDP contains a state space,
an action space, a transition function, a reward function, and
a decay parameter
The goal of optimal control is thus to find a controller
When the transition is not known to the controller, one can use
Q learning to indirectly solve the optimization problem. I will skip
details and go directly to the solution. Given a trail with
$$ f(w) = -\sum_{k=1}^K (\sum_{j=k}^K \gamma^{j-k}r_j) (a_k\log(\pi_k)+(1-a_k)\log(1-\pi_k))$$.
Essentially, this objective is maximized when control decisions all lead
to high value. Thus by tuning
In the code, you may notice that the discounted rewards
The controller is modeled as a single-layer neural network:
It is found that a single layer is already sufficient for this environment setting. If needed, you can replace the network with more complicated ones.
Due to the probabilistic nature of the value function, we minimize an averaged
loss
With default problem settings, we can get a convergence curve similar this:
Y axis of the main plot is the total reward a policy achieves, X axis is the number of training epochs. The small window shows the normalized trajectory of cart positions and pole angles in the most recent trial. It can be seen that the learning achieves a good controller in the end.
To store videos, you will need to uncomment the line:
# self.env = wrappers.Monitor(self.env, dir, force=True, video_callable=self.video_callable)
By doing this, a serial of the simulation videos will be saved in the folder /tmp/trial
.
You can change problem parameters in gym_installation_dir/envs/classic_control/cartpole.py
.
More details about the setup of this physical environment can be found
in the gym documents.
Details on how to derive the governing equations for single pole can be
found at this technical report.
Corresponding equations for how to generalize this to multiple poles
can also be found at this paper
- Move all code to ipynb
- Add more intro to RL