Eclectic-Sheep/sheeprl

enabling self play

Opened this issue · 4 comments

hi,
i tried out this project and it is one of the few that actually works off the shelf, thank you for your work.
Is there a way to enable self play when training an agent? My usecase is to use DreamerV3 as a alternative to algorithms such as muzero to train agents for boardgames.
I have looked around the repo but this feature does not seem trivially available out of the box.

Hi @drblallo, thank you for your words!
You're right, self-play is not supported right now. Do you have any specific references about self-play that we can look upon?

from what i gather, pretty much everyone just implements it as "the environment has a function that tells you which is the current player, and the rewards are a vector with a element for each player", for example openspiel from google implements it as https://github.com/google-deepmind/open_spiel/blob/master/open_spiel/python/examples/tic_tac_toe_qlearner.py#L118

player_id = time_step.observations["current_player"] #gets the current player
agent_output = agents[player_id].step(time_step) #asks the agent assigned to that player what actions to perform
time_step = env.step([agent_output.action]) #performs the action

as far as i know there is no know math to do something fancier than this, except stuff like minmax, but those are alphago style algorithms which do not make much sense for algorithms like dreamer, so the whole thing should just require to have a array of agents instead of one.

In principle i am willing to implement this myself, if it is expected to be a circumscribed effort.

Hi @drblallo

Need this right now too. Before I start working on it and adapt what's done in cli.py::eval_algorithm to my wrapper, just wanted to ping to see if you've done that work already.

Basically the interface I'm looking for is something like that

agent = load(checkpoint_path, config, seed)
action = agent.act(obs_space.sample())

Thanks

Hi @drblallo and @geranim0! The one thing that you could do to enable self-play is:

  • Create a new agent (inheriting from an already defined one, like Dreamer-V3 for example) in a new folder and adapt it so that it instantiates multiple agent as the number of player you need, i.e. calling the build_agent N times and save those agents in a dict or list
  • Create a wrapper to open-spiel so that the environment can be directly used in sheeprl
  • Interact with the environment as specified by open-spiel

This goes as far as the observations, actions, rewards and everything that could be saved in the same rollout or replay buffer has a dimension of [sequence_length, num_envs, ...]. Also bear in mind that in every algorithm we add a leading 1 in everything we are going to save in the replay buffer because we suppose that the vectorized environment returns something of shape [num_envs, ...]. One thing that you could do is to save into the buffer arrays of shape [seq_len, num_envs, players_num, ...] or create a replay buffer for every player independently and sample accordingly.

This could be linked to #278