[Question] Basic question on how to interpret the reward of tictactoe_v3.env

Question

[Question] Basic question on how to interpret the reward of tictactoe_v3.env

gotexis opened this issue a year ago · 2 comments

Question

Maybe I am fundaamentally wrong about how to intterpret the reward, but why below doesn't work.

What I am looking for is simple, I would like to check for the winner.

env = tictactoe_v3.env()

def random_policy(mask):
    valid_actions = np.where(mask)[0]
    print("valid:", valid_actions)
    return np.random.choice(valid_actions)


n_eval_episodes = 100
stats = {
    "player_1": 0,
    "player_2": 0,
    "draw": 0
}

for episode in range(n_eval_episodes):
    env.reset()
    done = False
    while not done:
        for agent in env.agent_iter():
            observation, reward, done, _, _ = env.last()  # Simplifying your unpacking here
            
            if done:
                if reward == 1:  # Assuming reward of 1 indicates a win for player_1
                    stats["player_1"] += 1
                elif reward == -1:  # Assuming reward of -1 indicates a win for player_2
                    stats["player_2"] += 1
                else:  # Assuming a draw otherwise
                    stats["draw"] += 1
                break  # Exit the inner loop if the game is done
            
            else:
                mask = observation["action_mask"]
                action = random_policy(mask)  # Assuming random_policy is defined somewhere
            env.step(action)

print(f"Result: {stats}")  # returns ->  {'player_1': 0, 'player_2': 90, 'draw': 10}

Answer 1 · 2023-10-09T09:02:24.000Z

wins = {
    "player_1": 0,
    "player_2": 0,
}

for episode in range(n_eval_episodes):
    env.reset()

    for agent in env.agent_iter():
        observation, reward, termination, truncation, info = env.last()

        if termination or truncation:
            action = None
            if reward == 1:
                wins[agent] += 1
            elif reward == 0:
                pass
            elif reward == -1:
                pass
        else:
            mask = observation["action_mask"]
            # this is where you would insert your policy
            action = env.action_space(agent).sample(mask)

        env.step(action)
    env.close()

print(f"Result wins: {wins}, draws: {n_eval_episodes - sum(wins.values())}")

seem like this works

Answer 2 · 2023-10-11T19:08:36.000Z

Looks like you got it working, feel free to reopen if you have any more questions