\setbeamertemplate{footline}[frame number]
- The sensors are some probability distribution of some observation given a certain state. \pause
- Unlike traditional MDPs that map from states to actions, these map from belief states(observations) to actions. \pause
- Exact optimal solutions yield the optimal action for each possible belief state that minimizes our cost.
\pause \begin{center} \includegraphics[width=0.8\textwidth]{fig1.jpg} \end{center}
- S is the set of states \pause
- A is the set of actions \pause
- T is the state transition probability
$P(s’|s,a)$ \pause - R is the reward function
$R(s,a)$ \pause -
$Ω$ is the set of observations \pause - O is the set of conditional observation probabilities
$P(o|s’) ∨ P(o|s’,a)$ \pause -
$γ$ is the discount factor. \pause - As usual, we want an optimal policy (
$π$ ) that maximizes expected future reward.
- a
- Confounded POMDPs (agent is not sure of the state/reward association) \pause
- Decentralized POMDPs (multiple collaborating agents with only local information) \pause
- \pause -
- On Policy Methods \pause - \pause
- Off Policy Methods \pause -
using Pkg; Pkg.add("POMDPs"); Pkg.add("QMDP");
using POMDPs, QuickPOMDPs, POMDPTools, QMDP
m = QuickPOMDP(
states = ["left", "right"],
actions = ["left", "right", "listen"],
observations = ["left", "right"],
initialstate = Uniform(["left", "right"]),
discount = 0.95,
transition = function (s, a)
if a == "listen"
return Deterministic(s) # tiger stays behind the same door
else # a door is opened
return Uniform(["left", "right"]) # reset
end
end,
observation = function (s, a, sp)
if a == "listen"
if sp == "left"
return SparseCat(["left", "right"], [0.85, 0.15]) # sparse categorical distribution
else
return SparseCat(["right", "left"], [0.85, 0.15])
end
else
return Uniform(["left", "right"])
end
end,
reward = function (s, a)
if a == "listen"
return -1.0
elseif s == a # the tiger was found
return -100.0
else # the tiger was escaped
return 10.0
end
end
)
solver = QMDPSolver()
policy = solve(solver, m)
rsum = 0.0
for (s,b,a,o,r) in stepthrough(m, policy, "s,b,a,o,r", max_steps=10)
println("s: $s, b: $([s=>pdf(b,s) for s in states(m)]), a: $a, o: $o")
global rsum += r
end
println("Undiscounted reward was $rsum.")