twni2016/Memory-RL

Movement Penalty in TMaze

jakegrigsby opened this issue · 3 comments

Hi, thanks for releasing this! I'm curious about the reward function used in the main TMaze results. Maybe I'm misunderstanding the way the active/passive versions disentangle credit assignment.

It seems like the goal of both environments is to unit test whether a Transformer can learn to recall information from the first (passive) or third (active, after going backwards first) observation at the final timestep. The penalty is meant to remove the sparse exploration problem of navigating to the goal position. You implement that penalty as rew = float(x < time_step - oracle_length) * penalty which is active when the policy falls behind pace of the optimal policy. Since the policy has no way to make up that pace, it's penalized from the first timestep it disagrees with the optimal policy. Wouldn't we be testing the same recall ability on the final timestep if the penalty was instead rew = float(action != right and time_step > oracle_length) * penalty? Both provide dense signal to move towards the goal, but the second version can provide that signal in any sub-optimal episode while the original version basically seems like the kryptonite of epsilon greedy and forces an unusually low-epsilon schedule?

Hi, I think both penalties seem okay. My penalty function strictly follows the definition of credit assignment of an optimal policy, while yours (not moving right) follows the CA length for current policy. Yours seems more intuitive and easier to solve, but it loses the connection with the optimal policy (recall that the CA length is defined on the support of history visitation of the optimal policy)

Ah, so this is fine to improve sample efficiency on passive but breaks the CA evaluation of the active version?

I tried this on passive and it converges to 0 (always go right) quickly. The issue is that we need exploration on the last timestep, but if there is any exploration before that then we don't make it to the goal. Dropping epsilon to 1/T makes it there with prob 1/e as you note in the appendix, but then we might need thousands of episodes before the 1/T chance discovers the up/down actions on the last timestep rather than going right like we converged to on all other states.

Seems like the fastest approach is a custom epsilon schedule that is near zero when t < horizon and much higher on the last timestep. Very hacky, but still evaluates the memory question and I'm seeing ~100x sample efficiency boost...

Ah, so this is fine to improve sample efficiency on passive but breaks the CA evaluation of the active version?

Sorry I think my previous answer is incorrect. Indeed, both my and your penalty functions satisfy the CA length of optimal policy (the differences appear in suboptimal trajectories, where the optimal policy does not cover). Yours will be easier to solve, and you empirically verified it.

Seems like the fastest approach is a custom epsilon schedule that is near zero when t < horizon and much higher on the last timestep. Very hacky, but still evaluates the memory question and I'm seeing ~100x sample efficiency boost...

Yes, I fully agree that besides memory, the exploration is also important to solving T-Mazes. Current vanilla eps-greedy works, but is not efficient. I think it is fair as long as you use the same exploration strategy and the same penalty function, to compare different sequence architectures.