Questions about the node_acts & job_acts selection.

Lines 73 to 80 in c010dd7

    
           logits = tf.log(self.node_act_probs) 
        
           noise = tf.random_uniform(tf.shape(logits)) 
        
           self.node_acts = tf.argmax(logits - tf.log(-tf.log(noise)), 1) 
        
           # job_acts [batch_size, num_jobs, 1] 
        
           logits = tf.log(self.job_act_probs) 
        
           noise = tf.random_uniform(tf.shape(logits)) 
        
           self.job_acts = tf.argmax(logits - tf.log(-tf.log(noise)), 2)

In my opinion, the self.node_act_probs in your code represents the probability of selecting each node, and then you use noise to explore the action space of the Reinforcement Learning problem.

However, after formulating your code, I get $node_acts = argmax(\frac{p}{-log(noise)})$
, where $p \in [0, 1]$ is the probability of selecting each node, $noise \in (0, 1)$ follows normal distribution. But we know that $-log(noise) \in (0, +\infty)$, after the above calculation, maybe the choice of the next node is completely random.

Maybe my understanding is wrong, but how should I understand this implementation?

BTW, have you considered using the $\epsilon-greedy$ policy to explore the action space? And why not select this policy?

Hi, happy birthday! Thank you for carefully examining our code. This part of the action sampling is indeed tricky to understand. We learn this from OpenAI's implementation. It's a mathematical trick called Gumbel-max trick. Detailed explanation can be found at http://amid.fish/humble-gumbel, https://lips.cs.princeton.edu/the-gumbel-max-trick-for-discrete-distributions/ and openai/baselines#358

About epsilon-greedy, note that the vanilla use of the policy gradient family (e.g., A2C) is on-policy (has to use data sampled from the current policy). Epsilon-greedy creates a bias in the data (because sometimes the action is sampled from random, not just from the current policy). You will need a correction, such as importance sampling, to make the training data unbiased for policy gradient. To avoid all this complication, the standard way of exploration for policy gradient is by increasing the entropy of action distribution and let the random sampling naturally explore. Hope these help!

btw, I think tensorflow now supports something like distribution.categorical.sample(). This should be simple to use and robust enough for this application. When we developed Decima, this option did not exist (I think).

Thank you for your quick reply and answer, which solved my doubts very well!

	logits = tf.log(self.node_act_probs)
	noise = tf.random_uniform(tf.shape(logits))
	self.node_acts = tf.argmax(logits - tf.log(-tf.log(noise)), 1)

	# job_acts [batch_size, num_jobs, 1]
	logits = tf.log(self.job_act_probs)
	noise = tf.random_uniform(tf.shape(logits))
	self.job_acts = tf.argmax(logits - tf.log(-tf.log(noise)), 2)