Confusion over shape of returns_to_go in get_batch
Opened this issue · 0 comments
DaveyBiggers commented
Hi, I'm trying to understand the following code in gym/experiment.py/get_batch():
rtg.append(discount_cumsum(traj['rewards'][si:], gamma=1.)[:s[-1].shape[1] + 1].reshape(1, -1, 1))
if rtg[-1].shape[1] <= s[-1].shape[1]:
rtg[-1] = np.concatenate([rtg[-1], np.zeros((1, 1, 1))], axis=1)
...
tlen = s[-1].shape[1]
As far as I can understand it, it's creating a sequence of (tlen + 1) rtg values, then checking whether the sequence length is <= tlen, and padding it with an extra value if not. (I'm struggling to see how this situation will ever arise.)
A few lines later, the padding code is applied, pre-padding with 0s to make sure everything is length max_len
, except for rtg, which will now be length max_len + 1
.
I don't understand the purpose of this extra value, especially since it seems to get stripped anyway by the SequenceTrainer:
state_preds, action_preds, reward_preds = self.model.forward(
states, actions, rewards, rtg[:,:-1], timesteps, attention_mask=attention_mask,
)
Am I missing something?
Thanks!