Some problems after reading the paper and code
Opened this issue · 0 comments
CodingNovice7 commented
After reading your paper and open source code, I have three doubts. If it's convenient, I hope you can help me answer it.
- It seems that the function of reward in the data set is only used to generate return-to-go during get_batch. Although there is reward input during training and evaluation, the function of reward in the network is not seen. Is the function of reward only to generate return to go?
- When determining the environment, you need to determine a target_ Return, I don't know What is the function of target_return? It seems that even if it is larger than the largest return-to-go in the existing data set, the final experiment can be successful. Emmm, that is to say, I want to know about target_ What is the impact of return on the Internet?
- During evalustion, each target_return evaluation 100 rounds . According to my understanding, the evaluation result should be better and better. That is to say, the reward in the evaluation stage should be better and better. However, the result I ran out is not like this. What is the reason?
I hope you can solve my doubts at your convenience. Thank you very much! Good luck!