Started with $\sigma = 0.1$ which didn't improve the policy much over time. Increased $\sigma$ to 0.9 which helped sample $\theta$ from a wider distribution, and this resulted in finding better $\theta$ that results in near optimal return.
Value Iteration
It took about 50 iterations to converge using a $\epsilon$ (error-threshold) of 1e-6.
Cartpole
Parameterized Gradient Ascent
Tried three different $\sigma$ values (0.1, 0.3, 0.9). $\sigma=0.9$ gives the best convergence both in terms of reward and speed of convergence.
Cross Entropy Method
With $K=10$, $K_{\epsilon}=3$, $\epsilon=0.99$ converged to the optimal return of 1000 in about 80 iterations. At each iteration, we keep the best theta average of the top $K_{\epsilon}$ candidates.