This is an attmept to get some better intution about MDPs. Insight into their properties; structure, geometry, dynamics, ...
This work is inspired by;
- The Value Function Polytope in Reinforcement Learning
- Towards Characterizing Divergence in Deep Q-Learning
- Implicit Acceleration by Overparameterization
- Efficient computation of optimal actions
- All experiments are done with tabular MDPs.
- The rewards and transtions are stochastic.
- We use synchronous observations and updating.
- It is sometimes assumed we have access to the transition function and reward function.
pip install git+https://github.com/act65/mdps.git
Or. If you want to develop some new experiments.
git clone
cd mdps
pip install -r requirements.txt
python setup.py develop
I used a random seed in my experiments so you should be able to run each of the scripts in experiments/
and reproduce all the figures in figs/
.
-
density_experiments.py
: How are policies distributed in value space?- Visualise the density of the value function polytope.
- Calculate the expected suboptimality (for all policies - and all possible Ps/rs)? How does this change in high dimensions?
-
discounting_experiments.py
:- Visualise how changing the discount rate changes the shape of the polytope.
- How does the discount change the optimal policy?
- Explore and visualise hyperbolic discounting.
-
iteration_complexity_experiments.py
: How do different optimisers partition the value / policy space?- Visualise how the number of steps required for GPI partitions the policy / value spaces.
- Visualise color map of iteration complexity for for PG / VI and variants.
- calculate the tangent fields. are they similar? what about for parameterised versions, how can we calculate their vector fields???
-
trajectory_experiments.py
: What do the trajectories of momentum and over parameterised optimisations look like on a polytope?- Visualise how momentum changes the trajectories of different solvers
- How does over parameterisation yield acceleration? And how does its trajectories relate to optimisation via momentum?
- Test dynamics with complex valued parameters.
- Generalise the types of parameterised fns (jax must have some tools for this)
- Other
- Generalise GPI to work in higher dimensons. Calclate how does it scales.
-
generalisation.py
: How does generalsation accelerate the dynamics / learning?- Use the NTK to explore trajectories when generalisation does / doesnt make sense.
- ???
-
lmdp_experiments.py
: How do LMDPs compare to MDPs?- Do they give similar results?
- What does the linearised TD operator look like (its vector field)?
- How do they scale?
-
graph_signal_vis.py
Generate a graph of update transitions under an update fn. The nodes will be the value of the deterministic policies. This could be a way to visualise in higher dimensins!? Represent the polytope as a graph. And the value is a signal on the graph. Need a way to take V_pi -> \sum a_i . V_pi_det_i. Connected if two nodes are only a single action different. -
mc-grap.py
: could explore different ways of estimating the gradient; score, pathwise, measure. - Visualise a distributional value polytope.
- The effects of exploration, sampling and buffers.
- How the effect stability of the dynamics
- How the bias the learning
- ?
Observations
- Parameterisation seems to accelerate PG, but not VI. Why?
- With smaller and smaller learning rates, and fixed decay, the dynamics of momentum approach GD.
Questions
- Is param + momentum dynamics are more unstable? Or that you move around value-space in non-linear ways??
- Is param + momentum only faster bc it is allowed larger changes? (normalise for the number of updates being made). Answer: No. PPG is still faster.
- What is the max difference between a trajectory derived from cts flow versus a trajectory of discretised steps on the same gradient field?
- What happens if we take two reparameterisations of the same matrix? Are their dynamics different? Answer: No. (???)
- What are the ideal dynamics? PI jumps around. VI travels in straight lines, kinda.