asappresearch/emergent-comms-negotiation

reward modelling question

Closed this issue · 4 comments

In the prosocial case, the reward printed should be rewards[0][2] and not the one listed here.

In fact, what does taking the mean of all 3 rewards mean? For selfish, I believe there should be 2 rewards: one for each agent at the end of the episode.

Yes rewards[0] gives the rewards for the first episode in the batch but it has 3 entries: first 2 denote the selfish reward for each agent and the third one denotes the prosocial reward. So I'm saying taking the mean over all of them doesn't mean anything. Either it should print both the selfish rewards or only the prosocial one.

Thats good to know. Thanks!