Questions about the node_acts & job_acts selection.
jiahy0825 opened this issue · 3 comments
Lines 73 to 80 in c010dd7
In my opinion, the self.node_act_probs
in your code represents the probability of selecting each node, and then you use noise
to explore the action space of the Reinforcement Learning problem.
However, after formulating your code, I get
, where
Maybe my understanding is wrong, but how should I understand this implementation?
BTW, have you considered using the
Hi, happy birthday! Thank you for carefully examining our code. This part of the action sampling is indeed tricky to understand. We learn this from OpenAI's implementation. It's a mathematical trick called Gumbel-max trick. Detailed explanation can be found at http://amid.fish/humble-gumbel, https://lips.cs.princeton.edu/the-gumbel-max-trick-for-discrete-distributions/ and openai/baselines#358
About epsilon-greedy, note that the vanilla use of the policy gradient family (e.g., A2C) is on-policy (has to use data sampled from the current policy). Epsilon-greedy creates a bias in the data (because sometimes the action is sampled from random, not just from the current policy). You will need a correction, such as importance sampling, to make the training data unbiased for policy gradient. To avoid all this complication, the standard way of exploration for policy gradient is by increasing the entropy of action distribution and let the random sampling naturally explore. Hope these help!
btw, I think tensorflow now supports something like distribution.categorical.sample(). This should be simple to use and robust enough for this application. When we developed Decima, this option did not exist (I think).
Thank you for your quick reply and answer, which solved my doubts very well!