How to make model consider immediate reward ?

Question

How to make model consider immediate reward ?

glitter2626 opened this issue 3 years ago · 4 comments

I want to use immediate reward from environment to teach my RL model. As described in the document, I implemented "reward" function in "Environment" class.

However, when I checked loss calculation flow in train.py file, losses['v'] seems only consider value outputted from model and outcome from environment. Also, I found that loss['r'] takes into account the rewards from the environment.

Does this mean that my model also needs to output a "return" value ?

Answer 1 · 2022-02-13T10:22:26.000Z

@glitter2626
Thanks for your great question!
Yes, return predicts the cumulative sum of the immediate rewards in the main code.

In this branch #225, we removed outcome. And then, value predicts the cumulative sum of immediate rewards (possibly multidimensional). We can also set gamma as a list for setting different gamma for different reward dimension.

Answer 2 · 2022-02-14T04:00:48.000Z

@YuriCat Thanks for your clear reply.

I also have another question about solo training. In self-play, the model always plays with itself. However, when we choose training batch, we only randomly select an agent's episode.

Is there some reason why episodes of all agents are not considered in same batch, e.g. stable training? Can I simply comment it to consider episodes of all agents in the same batch?

[update]
After I comment solo training, my training loss become very unstable : (

Answer 3 · 2022-02-15T00:19:25.000Z

@glitter2626 Could you please try this fix #276 ?

Of course, we can train all the players at the same time, but we had not been carefully checked such cases. Thank you for your pointing out!
Note that the input might be biased if the observations contain the same data.

Answer 4 · 2022-02-16T04:10:27.000Z

@YuriCat Thanks your fast solution again!

I have this question is because I want to utilize this awesome framework to deal with multi-agents cooperative problem. I revise some loss calculation step as described in VDN paper. But my current solution seems exist some bugs. I will try to fix it.

All in all, just wanted to thank you for your contributions, I learned a lot from you.