/POMCP

Implementation of Partially Observable Monte Carlo Planning Algorithm

Primary LanguagePythonMIT LicenseMIT

README

Description

This project contains code for a general purpose Partially Observable Monte Carlo Tree Search algorithm taken from [1]. This is an online POMDP solver, ie it approximates the optimal policy locally at the current belief.

Documentation

auxilliary.py

Contains some helpful functions and a class to help construct and manipulate the tree of agent beliefs.

The way we implemented the tree is through the BuildTree() class. Each node in the tree is associated with a list of five elements. These are:

  1. The key of its parent node
  2. A dictionary whose keys are either action or observation indices and its values are the key of some child node. This holds all the child nodes at each node. For example, say taking the index 1 action from the current node leads us to the key 253 node. Then there will be an entry {1:253} in this dictionary.
  3. The number of times a simulation visited this node.
  4. The current estimate of the nodes value (kept as a running average of simulation values)
  5. The list of particles associated with the node; A list of all states that were sampled at this node at some simulation.

The association is implemented with a Python dictionary. Each node has a unique key which can be used to access its values. The root node always has -1 as key.

UCB()

Calculates UCB score.

Input:

  1. N (int): Total number of simulations (from root).
  2. n (int): Total number of simulations through current node.
  3. V (float): Running average.
  4. c (float): Controls importance of exploration in score.

Outputs:

  1. (float): UCB score for node.
powerset()

Returns the powerset of a given list. (Taken from itertools recipes)

Input:

  1. iterable (list):

Outputs:

  1. (iterable): powerset of input
BuildTree()
__init__()
  1. giveParameters (list): A list of necessary information for root node. Default is ['isRoot',{},0,0,[]]

Sets the following attributes:

  1. self.nodes (dictionary): Stores the tree.
ExpandTreeFrom()
Expands the tree from a given *parent* node. It adds a new node and informs its parent.

Input:

  1. parent (int): index of node from which the tree is expanded.
  2. index (int): index of action or observation.
  3. IsAction (Boolean): If expansion of tree is due to action or observation. Default False.

Outputs:

  1. None; updates the tree
isLeafNode()

Checks if a given key corresponds to a leaf node.

Input:

  1. n (int): key of node in questions.

Outputs:

  1. (Boolean): If the given key corresponds to a leaf node.
getObservationNode()

Taken a generated observation o and the key of the current node (say ha), this will either return the key of the resulting node hao or it will create this node and return its key (if it does not exist).

Input:

  1. h (int): key of current node.
  2. sample_observation (int): Index of sampled observation.

Outputs:

  1. (int): key for resulting node in current tree.
prune()

Removes input node from tree and calls itself on all of its children.

Input:

  1. node (int): key of node from which pruning starts.

Outputs:

  1. None; updates tree.
make_new_root()

Sets an input node as the new root.

Input:

  1. node (int): key of node from which pruning starts.

Outputs:

  1. None; updates tree.
prune_after_action()

After an input action is taken and an input observation occurs, sets the new root and prunes redundant branches of the old tree.

Input:

  1. action (int): index of action taken.
  2. observation (int): index of observation that occured.

Outputs:

  1. None; updates tree.
POMCP.py

Contains the POMCP class.

POMCP()
__init__()
  1. Generator (function): Specifies a function to be used as a blackbox generator for the underlying POMDP dynamics. This will be called during simulations and should take as arguments the indices of a state and an action in the underlying state and action spaces.
  2. gamma (float): The discounting factor for cost calculation. Should be <1. Default value is 0.95.
  3. e (float): Threshold value below which the expected sum of discounted rewards for the POMDP is considered 0. Default value is 0.005.
  4. c (float): Parameter that controls the importance of exploration in the UCB heuristic. Default value is 1.
  5. timeout (int): Controls the number of simulations that will be run from the current root at each call of Search(). Default value is 10000.
  6. no_particles (int): Controls the maximum number of particles that will be kept at each node and the number of particles that will be sampled from the posterior belief after an action is taken. Default value is 1200.
  7. Parallel (Boolean): Controls if posterior belief is sampled with multiple threads. Default value is False. Tested only on Ubuntu.
initialize()

Input:

  1. S (list): State space indices
  2. A (list): Action space indices
  3. O (list): Observation space indices

Initializes some internal variables.

SearchBest()

Finds the best action to take at a node according to some given metric (UCB or Value).

Input:

  1. h (int): Current node.
  2. UseUCB (Boolean): Whether to use UCB as metric or simply node value. Default True.

Outputs:

  1. (int): Action index with highest score (UCB or value)
  2. (int): key for resulting node in current tree.
Search()

Runs timeout number of simulations from current root and outputs the index of the best action to take according to estimated values.

Outputs:

  1. (int): Action index with highest value.
getObservationNode()

Taken a generated observation o and the key of the current node (say ha), this will either return the key of the resulting node hao or it will create this node and return its key (if it does not exist).

Input:

  1. h (int): key of current node.
  2. sample_observation (int): Index of sampled observation.

Outputs:

  1. (int): key for resulting node in current tree.
Rollout()

Runs Rollout as specified in [1].

Input:

  1. h (int): key of current node.
  2. depth (int): Current depth of tree; Important for recursion. When called set depth = 0

Outputs:

  1. (float): Estimate of input node value.
Simulate()

Runs Simulate as specified in [1].

Input:

  1. s (int): Index of sample initial state.
  2. h (int): key of current node.
  3. depth (int): Current depth of tree; Important for recursion. When called set depth = 0

Outputs:

  1. (float): Estimate of root node value. (it affects values in the entire tree)
PosteriorSample()

Samples from the posterior belief. Start by sampling a state from prior belief, then sampling a subsequent state from the transition probability distribution and an observation from the emission probability distribution until the sampled observation matches the real observation.

Input:

  1. Bh (list): List of (prior) particles.
  2. action (int): index of action taken.
  3. observation (int): index of observation that occured.

Outputs:

  1. (int): index of sampled state (from posterior belief)
UpdateBelief()
Updates particles of new root node after an action is taken and an observation occurs.

Input:

  1. action (int): index of action taken.
  2. observation (int): index of observation that occured.

Outputs:

  1. None; updates Bh element of new root node (with no_particles samples from the posterior belief)

Bibliography

[1] Silver, David, and Joel Veness. "Monte-Carlo planning in large POMDPs." Advances in Neural Information Processing Systems 23 (NIPS) (2010).