mdps: A Python repository from act65

This is an attmept to get some better intution about MDPs. Insight into their properties; structure, geometry, dynamics, ...

This work is inspired by;

Setting

All experiments are done with tabular MDPs.
The rewards and transtions are stochastic.
We use synchronous observations and updating.
It is sometimes assumed we have access to the transition function and reward function.

pip install git+https://github.com/act65/mdps.git

Or. If you want to develop some new experiments.

git clone
cd mdps
pip install -r requirements.txt
python setup.py develop

I used a random seed in my experiments so you should be able to run each of the scripts in experiments/ and reproduce all the figures in figs/.

density_experiments.py: How are policies distributed in value space?
- Visualise the density of the value function polytope.
- Calculate the expected suboptimality (for all policies - and all possible Ps/rs)? How does this change in high dimensions?

discounting_experiments.py:
- Visualise how changing the discount rate changes the shape of the polytope.
- How does the discount change the optimal policy?
- Explore and visualise hyperbolic discounting.

generalisation.py: How does generalsation accelerate the dynamics / learning?
- Use the NTK to explore trajectories when generalisation does / doesnt make sense.
- ???

lmdp_experiments.py: How do LMDPs compare to MDPs?
- Do they give similar results?
- What does the linearised TD operator look like (its vector field)?
- How do they scale?

graph_signal_vis.py Generate a graph of update transitions under an update fn. The nodes will be the value of the deterministic policies. This could be a way to visualise in higher dimensins!? Represent the polytope as a graph. And the value is a signal on the graph. Need a way to take V_pi -> \sum a_i . V_pi_det_i. Connected if two nodes are only a single action different.
mc-grap.py: could explore different ways of estimating the gradient; score, pathwise, measure.
Visualise a distributional value polytope.

The effects of exploration, sampling and buffers.
- How the effect stability of the dynamics
- How the bias the learning
- ?

Observations

Parameterisation seems to accelerate PG, but not VI. Why?
With smaller and smaller learning rates, and fixed decay, the dynamics of momentum approach GD.

Questions

Is param + momentum dynamics are more unstable? Or that you move around value-space in non-linear ways??
Is param + momentum only faster bc it is allowed larger changes? (normalise for the number of updates being made). Answer: No. PPG is still faster.
What is the max difference between a trajectory derived from cts flow versus a trajectory of discretised steps on the same gradient field?
What happens if we take two reparameterisations of the same matrix? Are their dynamics different? Answer: No. (???)
What are the ideal dynamics? PI jumps around. VI travels in straight lines, kinda.