twni2016/Memory-RL

Questions about the implementation

hai-h-nguyen opened this issue ยท 9 comments

Hi,

What are the purposes of past_key_values and past_embeds in your GPT-v2 agent? I didn't see past_embeds being fed to the transformer model but it is saved within h_0. For past_key_values, is it used because we don't have to recompute them, given the weights of the model do not change during inference?

https://github.com/twni2016/Memory-RL/blob/main/policies/seq_models/gpt2_vanilla.py

You're correct about past_key_values, during inference when a sequence is generated auto-regressively, past_key_values don't need to be recomputed at every time step.

past_embeds are used internally in the model in a similar inference scenario, since previous inputs are still necessary to generate the next token.

Hi, thanks for your answer. Another question in the TMaze Active domain. It seems that the ambiguous_position flag is chosen, which makes the domain more challenging. Is that version used in the experiment section of the paper?

Hi, thanks for your answer. Another question in the TMaze Active domain. It seems that the ambiguous_position flag is chosen, which makes the domain more challenging. Is that version used in the experiment section of the paper?

Yes, we use this setting in both Passive and Active T-Mazes.

Another thing is the reward function seems to be history-based, kind of non-standard, i.e., R(s, a, s'). The episode length is set to be the episode length of the optimal policy, which I think might make the problem harder. Can you elaborate on these choices?

History-dependent reward

We define memory length in reward function R(h,a). We may re-define it in "standard form" of reward function R(h,a,o') that incorporates next observation. If we allow more steps the Passive T-Maze, then it has a Markovian reward that only depends on (o,a,o'). Nevertheless, the transition function P(o' | h,a) has a memory length of horizon now. Thus, the memory lengths in reward and transition can somewhat be exchanged in this task, depending on problem formulation.

Is history-dependent reward non-standard?

If we consider the reward function in the form of R(h,a), then many memory-based tasks have history-dependent rewards (e.g. POP Gym, see the task analysis in our appendix).

Will more steps make the problem easier?

Maybe, but I don't think this matters much, as it also increases the steps of exploration.

How do you run the recurrent SAC for continuous actions in this repo? I couldn't find the instructions to run and it seems the code to run it is missing.

@hai-h-nguyen I'm working on open-sourcing the code on continuous control experiments, which is written in JAX (although it should be straightforward to write it in PyTorch). I will push the code to another branch as it was developed in a separate repo.

Thanks, it should be good enough. Out of curiosity, why didn't you reuse the PyTorch code from your pomdp-baselines instead of implementing it all in Jax?

Sorry for the delay. I have uploaded the code and the logged data in a new branch.

I wanted to switch to the jax implementation for all the experiments including discrete and continuous control, since gpt was run much faster in jax than pytorch for long sequences (but not for short sequences). However, I found I cannot reproduce the results between jax and pytorch codes in discrete control, so I have to revert back to pytorch version for discrete control. This was a twisted story..