POMDPs: Myths, Legends, and Reality

\setbeamertemplate{footline}[frame number]

What are POMDPs?

POMDPs are MDPs with sensors instead of direct observation of the current state.

The sensors are some probability distribution of some observation given a certain state. \pause
Unlike traditional MDPs that map from states to actions, these map from belief states(observations) to actions. \pause
Exact optimal solutions yield the optimal action for each possible belief state that minimizes our cost.

\pause \begin{center} \includegraphics[width=0.8\textwidth]{fig1.jpg} \end{center}

POMDPs are defined by a tuple $(S,A,T,R,Ω,O,γ)$

S is the set of states \pause
A is the set of actions \pause
T is the state transition probability $P(s’|s,a)$ \pause
R is the reward function $R(s,a)$ \pause
$Ω$ is the set of observations \pause
O is the set of conditional observation probabilities $P(o|s’) ∨ P(o|s’,a)$ \pause
$γ$ is the discount factor. \pause
As usual, we want an optimal policy ($π$) that maximizes expected future reward.

POMDPs rely on a unique belief update step as a part of the algorithm.

Notes on theory, undecidability, and intractability

POMDP Variants and Applications

Variants of POMDPs exit to tackle certain niche problems.

Confounded POMDPs (agent is not sure of the state/reward association) \pause
Decentralized POMDPs (multiple collaborating agents with only local information) \pause

- \pause -

Various methods of solving/approximating POMDPs are currently available/used.

On Policy Methods \pause - \pause
Off Policy Methods \pause -

Implementation Recommendations

My favorite package: POMDPs.jl is my favorite because it’s easy to use: Installation

using Pkg; Pkg.add("POMDPs"); Pkg.add("QMDP");
using POMDPs, QuickPOMDPs, POMDPTools, QMDP

My favorite package: POMDPs.jl is my favorite because it’s easy to use: Definitions

m = QuickPOMDP(
    states = ["left", "right"],
    actions = ["left", "right", "listen"],
    observations = ["left", "right"],
    initialstate = Uniform(["left", "right"]),
    discount = 0.95,
    transition = function (s, a)
        if a == "listen"
            return Deterministic(s) # tiger stays behind the same door
        else # a door is opened
            return Uniform(["left", "right"]) # reset
        end
    end,
    observation = function (s, a, sp)
        if a == "listen"
            if sp == "left"
                return SparseCat(["left", "right"], [0.85, 0.15]) # sparse categorical distribution
            else
                return SparseCat(["right", "left"], [0.85, 0.15])
            end
        else
            return Uniform(["left", "right"])
        end
    end,
    reward = function (s, a)
        if a == "listen"
            return -1.0
        elseif s == a # the tiger was found
            return -100.0
        else # the tiger was escaped
            return 10.0
        end
    end
)

solver = QMDPSolver()
policy = solve(solver, m)

rsum = 0.0
for (s,b,a,o,r) in stepthrough(m, policy, "s,b,a,o,r", max_steps=10)
    println("s: $s, b: $([s=>pdf(b,s) for s in states(m)]), a: $a, o: $o")
    global rsum += r
end
println("Undiscounted reward was $rsum.")

More Examples

Alternative Packages

References

\scriptsize bibliographystyle:plain ref.bib

johnabs/RLPresentation

POMDPs: Myths, Legends, and Reality

What are POMDPs?

POMDPs are MDPs with sensors instead of direct observation of the current state.

POMDPs are defined by a tuple $(S,A,T,R,Ω,O,γ)$

POMDPs rely on a unique belief update step as a part of the algorithm.

Notes on theory, undecidability, and intractability

POMDP Variants and Applications

Variants of POMDPs exit to tackle certain niche problems.

Various methods of solving/approximating POMDPs are currently available/used.

Implementation Recommendations

My favorite package: POMDPs.jl is my favorite because it’s easy to use: Installation

My favorite package: POMDPs.jl is my favorite because it’s easy to use: Definitions

Alternative Packages

References