/RL-Policy-Iteration-for-robot-agent

Implementation of the policy iteration algorithm to find the best trajectory for a robot to go from the start to the goal cell in a 2D grid

Primary LanguagePythonMIT LicenseMIT

Policy Iteration Algorithm for Optimal Robot Navigation

Overview

Policy Iteration is a powerful algorithm for finding the optimal policy for a given set of states and actions. In this project, I applied policy iteration to solve a complex navigation problem for a 2D robot.

Policy Iteration Algorithm

Let us assume we have a policy (𝝅 : S → A ) that assigns an action to each state. Action 𝝅(s) will be chosen each time the system is at state s.

The policy iteration algorithm can be summarized into three main steps:

  1. Evaluate a given policy (eg. initialise policy arbitrarily for all states s ∊ S) by calculating value function for all states s ∊ S under the given policy. image Value function = the expected reward collected at the first step + expected discounted value at the next state

  2. Improve policy : find a better action for state s ∊ S image

  3. Repeat step 1,2 until value function converge to optimal value function

The 2D Robot Agent Problem

To run and evaluate the policy iteration algorithm, we assume a 2D robot agent problem.
Consider the following Markov Decision Process. The state-space is a 10×10 grid, cells that are obstacles are marked in gray. The initial state of the robot is in blue and our desired terminal state is in green. The robot gets a reward of 10 if it reaches the desired terminal state with a discount factor of 0.9. At each non-obstacle cell, the robot can attempt to move to any of the immediate neighboring cells using one of the four controls (North, East, West and South). The robot cannot move diagonally. The move succeeds with probability 0.7 and with remainder probability 0.3 the robot can end up at some other cell as follows:

P(moves north |control is north) = 0.7,
P(moves west |control is north) = 0.1,
P(moves east |control is north) = 0.1,
P(does not move |control is north) = 0.1.

image

Similarly, if the robot desired to go east, it may end up in the cells to its north, south, or stay put at the original cell with total probability 0.3 and actually move to the cell east with probability 0.7. The cost pays a cost of 1 (i.e., reward is -1) for each control input it takes, regardless of the outcome. If the robot ends up at a state marked as an obstacle, it gets a reward of -10 for each time-step that it remains inside the obstacle cell. We would like to implement policy iteration to find the best trajectory for the robot to go from the blue cell to the green cell.

Results

Plotting the initial environment to verify that it confirms to the picture in the question,

image

Plot of the value function after 1 iteration:

image

Final plot of the value function:

image

Plot of the policy after 1 iteration:

image

Plot of the policy after 2 iterations:

image

Plot of the policy after 3 iterations

image

Plot of the policy after 4 iterations

image

Through the policy iteration algorithm, we were able to successfully find the optimal policy for the 2D robot navigation problem. This involved calculating the value function for each state under a given policy, and then improving the policy by selecting the action that resulted in the highest value for each state. By repeating this process until convergence, we were able to determine the optimal actions for the robot to take in order to reach the terminal state while minimizing cost and avoiding obstacles.