RL-Gridworld-and-Cartpole

Gridworld

Parameterized Policy

Started with $\sigma = 0.1$ which didn't improve the policy much over time. Increased $\sigma$ to 0.9 which helped sample $\theta$ from a wider distribution, and this resulted in finding better $\theta$ that results in near optimal return.

Value Iteration

It took about 50 iterations to converge using a $\epsilon$ (error-threshold) of 1e-6.

Cartpole

Parameterized Gradient Ascent

Tried three different $\sigma$ values (0.1, 0.3, 0.9). $\sigma=0.9$ gives the best convergence both in terms of reward and speed of convergence.

Cross Entropy Method

With $K=10$, $K_{\epsilon}=3$, $\epsilon=0.99$ converged to the optimal return of 1000 in about 80 iterations. At each iteration, we keep the best theta average of the top $K_{\epsilon}$ candidates.