Jack-Car-Rental-using-Policy-Iteration

This program is to find optimal policy over Jack's cars rental problem using policy iteration. There
are 2 locations (e.g. A and B) with initially 20 cars in each side. Cars in both location are rented
and returned according to certain distribution. Every night, we can move 5 cars from A to B or from
B to A. Cars moved in overnight plus remaining cars (not rented) cannot surpass 20. The maximum
number of cars to be rented in the next day is cars moved in overnight plus remaining cars on this
day. The number of cars requested and returned at each location are Poisson random variables
with λ as expected value. Assume λ is 3 and 4 for rental requests at locations A and B, and 3 and 2
for returns. We can earn $10 for one car rental and cost $2 for moving one car overnight.
T The original description about Jack's cars rental problem can be found in here.

Implementation

Running the entire process, I first form matrices representing environment's dynamics, which is associated
with triplet (state, next_state, action). State, in this case, can have 441 choices (number of cars in
location A/B can be [0,20]-->21X21=441), and action may be +5~-5 (sign refers to moving cars from A to
B or conversely). The matrices formed are 11X21X21 transition matrix and reward matrix. For the upcoming
dynamic programming (DP), we look up the matrices just formed. DP in policy iteration is pretty fast if
we load precomuted dynamics matrices. It costs about 30 minutes in my computer to form matrices corresponding
to the above setting. My original intention is to form the matrices using parallel programming, which really
takes advantages in the formulation of this problem. Perhaps I will implement it using Tensorflow afterward.
Also, improvement of value function for every state is checked to make sure we get the improved policy in each
iteration.

Result

The following shows errors of value function in policy evaluation, the improved policy in that iteration
(the one shown in the figure is the optimal policy), and the error between improved policy and original
policy (error = 0 means we get the optimal policy).

row[0:20] --> number of cars in location A is 020
col[0:20] --> number of cars in location B is 020

Dependencies

numpy
scipy
argparse

How to run

If using the precomputed matrices associated with environmental dynamics, just type in your terminal

$ python main.py

If you want to formulate a different problem setting (e.g. with different number of cars or number
of moves), you can type

$ python main.py --to_form_all True --params your_params

It will form and save .npy file in the same directory, and next time with the same problem setting,
you can simply load that .npy file, which may save a lot of time. You can get basic documentation from typing

$ python main.py --help

There are several parameters able to be adjusted.
Also, you can modify default value in file parameters.py

Author