List of stupid mistakes made throughout implementing this algorithm.
- Always check numpy array shapes. Specifically that you haven't broadcast a (64) dimension array over a (64, 1) dimension array! 🤦
- Check every variable. Spent ages trying to figure out why nothing was being learnt only to discover instead of returning states and next_states from the memory buffer sample I was instead just returning states and states! 🤦
- Copied and pasted the actor network while building the critic and accidentally forgot to remove the
tanh
activation meaning the critic could at most predict a total of reward1
or-1
for the entire episode given any state and action pair! 🤦 - Left the hard-coded high action bound in from training the pendulum environment as a default when initializing the actor model. Correctly adjusted it for the actor on the agent class but not the target actor meaning the target actor would always output 2 times the action the actor would! 🤦