[Question] Basic question on how to interpret the reward of tictactoe_v3.env
gotexis opened this issue · 2 comments
gotexis commented
Question
Maybe I am fundaamentally wrong about how to intterpret the reward, but why below doesn't work.
What I am looking for is simple, I would like to check for the winner.
env = tictactoe_v3.env()
def random_policy(mask):
valid_actions = np.where(mask)[0]
print("valid:", valid_actions)
return np.random.choice(valid_actions)
n_eval_episodes = 100
stats = {
"player_1": 0,
"player_2": 0,
"draw": 0
}
for episode in range(n_eval_episodes):
env.reset()
done = False
while not done:
for agent in env.agent_iter():
observation, reward, done, _, _ = env.last() # Simplifying your unpacking here
if done:
if reward == 1: # Assuming reward of 1 indicates a win for player_1
stats["player_1"] += 1
elif reward == -1: # Assuming reward of -1 indicates a win for player_2
stats["player_2"] += 1
else: # Assuming a draw otherwise
stats["draw"] += 1
break # Exit the inner loop if the game is done
else:
mask = observation["action_mask"]
action = random_policy(mask) # Assuming random_policy is defined somewhere
env.step(action)
print(f"Result: {stats}") # returns -> {'player_1': 0, 'player_2': 90, 'draw': 10}
gotexis commented
wins = {
"player_1": 0,
"player_2": 0,
}
for episode in range(n_eval_episodes):
env.reset()
for agent in env.agent_iter():
observation, reward, termination, truncation, info = env.last()
if termination or truncation:
action = None
if reward == 1:
wins[agent] += 1
elif reward == 0:
pass
elif reward == -1:
pass
else:
mask = observation["action_mask"]
# this is where you would insert your policy
action = env.action_space(agent).sample(mask)
env.step(action)
env.close()
print(f"Result wins: {wins}, draws: {n_eval_episodes - sum(wins.values())}")
seem like this works
elliottower commented
Looks like you got it working, feel free to reopen if you have any more questions