google-deepmind/open_spiel

Spades Implementation

i-Madsen opened this issue · 16 comments

Hi there, I'm trying to make a new implementation of Spades and am looking for some help along the way as I'm pretty new to RL. Ultimately, I'm hoping to use the framework to train a Spades AI and use it in an iOS/Android app made in Unity (but that's getting a little ahead of myself).

Since the game of Bridge has a similar structure (partnerships, bidding phase, trick taking) and is already in OpenSpiel, I'm using it as a base, but some game concepts are pretty different - particularly how bidding works.

So the first thing I want to make sure I'm doing correctly is dealing with the tensor sizing/representation.

bridge.h starts out with:

inline constexpr int kNumObservationTypes = 4;  // Bid, lead, declare, defend
// Because bids always increase, any individual bid can be made at most once.
// Thus for each bid, we only need to track (a) who bid it (if anyone), (b) who
// doubled it (if anyone), and (c) who redoubled it (if anyone).
// We also report the number of passes before the first bid; we could
// equivalently report which player made the first call.
// This is much more compact than storing the auction call-by-call, which
// requires 318 turns * 38 possible calls per turn = 12084 bits (although
// in practice almost all auctions have fewer than 80 calls).
inline constexpr int kAuctionTensorSize =
    kNumPlayers * (1           // Did this player pass before the opening bid?
                   + kNumBids  // Did this player make each bid?
                   + kNumBids  // Did this player double each bid?
                   + kNumBids  // Did this player redouble each bid?
                   ) +
    kNumCards                                  // Our hand
    + kNumVulnerabilities * kNumPartnerships;  // Vulnerability of each side
inline constexpr int kPublicInfoTensorSize =
    kAuctionTensorSize  // The auction
    - kNumCards         // But not any player's cards
    + kNumPlayers;      // Plus trailing passes
inline constexpr int kMaxAuctionLength =
    kNumBids * (1 + kNumPlayers * 2) + kNumPlayers;

As far as I understand, Bridge bidding has multiple actions you can take and keeps going until all other players pass, culminating in a single contract. Spades, on the other hand, has each player make a single bid and then the bidding phase ends, leaving '4' contracts (partnerships typically work together to meet their combined contract, but a 'Nil' bid (0) is player dependent). Spades also has no concepts of being vulnerable or having a declarer.

So with that in mind, I'm thinking I change it like this:

inline constexpr int kNumObservationTypes = 1;
inline constexpr int kAuctionTensorSize = kNumPlayers * kNumBids
                                          + kNumCards;  // Our hand
inline constexpr int kPublicInfoTensorSize =
    kAuctionTensorSize  // The auction
    - kNumCards;         // But not any player's cards
inline constexpr int kMaxAuctionLength = 4;

Later on, there is a method for getting the play tensor size.

bridge.h

  static int GetPlayTensorSize(int num_tricks) {
    return kNumBidLevels          // What the contract is
           + kNumDenominations    // What trumps are
           + kNumOtherCalls       // Undoubled / doubled / redoubled
           + kNumPlayers          // Who declarer is
           + kNumVulnerabilities  // Vulnerability of the declaring side
           + kNumCards            // Our remaining cards
           + kNumCards            // Dummy's remaining cards
           + num_tricks * kNumPlayers * kNumCards  // Number of played tricks
           + kNumTricks   // Number of tricks we have won
           + kNumTricks;  // Number of tricks they have won
  }

Again, bids in spades are a simple. single bid, one per player, with no concept of modifying calls, a declarer, or vulnerability. Spades will always be trump and there is no 'dummy' hand, so I'm assuming I can just remove half of these and then multiply bids/tricks by kNumPlayers since we'll want to track each player individually.

spades.h

  static int GetPlayTensorSize(int num_tricks) {
    return kNumBids * kNumPlayers                   // What each player's contract is
           + kNumCards                              // Our remaining cards
           + num_tricks * kNumPlayers * kNumCards   // Number of played tricks
           + kNumTricks * kNumPlayers;              // Number of tricks each player has won
  }

Am I misunderstanding anything or does this seem correct?

Hi @i-Madsen

That sounds about right to me, but I will give a bit of advice: those observation tensor functions are the most difficult ones to implement. Leave them until the end. I suggest getting a working implementation of Spades with those functions left blank or returning bogus values first because that's already a nontrivial task. It's best to discuss those functions only after the basic game logic functionality is implemented.

Gotcha, that makes sense. Thanks, I'll stick with this for now and continue on with getting the game logic all working.

I've now got the game functionality able to build and run, completing the basic game tests successfully (I'm ignoring/disabling anything to do with double_double and the uncontested_bidding variant). I think I'm about ready to take a stab at implementing the tensor functions, but have a question about the overall approach for this game.

The Bridge implementation runs only a single round as a zero sum game, always starting with the same player. While this makes sense for Bridge, Spades is a little different as both teams can go either positive or negative during a round. Therefore, the overall game state is much more important as a lower-risk bid that results in that partnership winning the overall game is much better than a high-risk bid that results in a higher point difference. (Overall game state also determines if bag penalties are a risk during the round.)

Currently, I've kept the Bridge structure of always starting with player 0 and playing a single round and my thoughts are that I'll just feed in game parameters like this:

/*parameter_specification=*/
                         {
                             // Whether to end the game early if score gets too low
                             {"use_mercy_rule", GameParameter(true)},

                             // If using mercy rule, the threshold of negative points
                             {"mercy_threshold", GameParameter(-350)},

                             // Amount of points needed to win the game
                             {"win_threshold", GameParameter(500)},

                             // Parnership's current scores
                             // (can infer bags from last digit)
                             {"score_partnership_0", GameParameter(0)},
                             {"score_partnership_1", GameParameter(0)},

                             // If true, replace the play phase with a computed
                             // result based on perfect-information play.
                             {"use_double_dummy_result", GameParameter(false)},

                             // Number of played tricks in observation tensor
                             {"num_tricks", GameParameter(2)},
                         }};

The outside training script can track the scores and manage the teams by 'rotating' what positions the players/team scores are in. (Also need to add a big bonus for winning to the returned rewards in the game environment?)

@lanctot Does this seem reasonable? Or do I need to rethink the approach?

Hi @i-Madsen , yes, that sounds very reasonable to me! And using the Bridge functions as a reference is good. You can also check out Hearts, Skat, and Dou Di Zhu games as well for reference. Good luck!

Hey @lanctot, I've gotten my current version of the tensor methods passing the basic game test; are there any other tests I should be running on them right now?

Also, any advice on what learning algorithm(s) to try on Spades?

Great! Well, I think some Spades-specific tests would be nice, but I'd say turn it into a PR if you're willing to contribute. Would be great to make it into the v1.5 release coming in the next few weeks if possible (and can announce it in the release notes).

For starting algorithms, I'd say maybe self-play DQN just to get your hands dirty, then NFSP, and R-NaD. See also the other thread on dominoes (#1218). A lot of the advice and pointers would apply to your case.

Cool, I cleaned up the code and I think I've got it all in a pull request now.

Note that a few game parameters are unused at the moment, but I've left them in for now (they'd be needed if the management of the overall game state is moved into the Spades code).

Related question to that: if I want to feed in new game parameters during training, is there a built in way to do that or do I need to make a custom method for Spades? So far from the examples I've looked at in the repo, env.reset() will get called, but that just reuses the same parameters that the game was initialized with.

@lanctot I was able to follow the guide about implementing game-specific functionality and have been using that while experimenting with DQN (I'm now incrementing the dealer and making sure the overall score is set after resetting the environment). Had some positive signs from the reward outputs, so I wanted to try a test with NFSP, but it gave an error that information_state_string wasn't implemented.

So now I've added:
spades.h:
std::string InformationStateString(Player player) const override;
spades.cc:

std::string SpadesState::InformationStateString(Player player) const {
  return ObservationString(player);
}

as well as setting /*provides_information_state_string=*/true,

(I think information and observation states being the same in this case is valid?)

However, when I generate a playthrough, spades.txt always shows GameType.provides_information_state_string = False which causes tests to fail whether I generate pre- or post-build.

Any idea what may be causing this? Is there another place I need to set provides_information_state_string that I'm missing?

Hi @i-Madsen,

The information state string is only used by tabular methods like CFR and XFP. I think what you you need to add the InformationStateTensor function for use with neural networks.

Alternatively, we could add a flag to the NFSP implementation that specifies using the observation tensor instead of the information state tensor (the RL environment does this implicitly, but NFSP is a method that was designed for imperfect information games so it assumes the existence of an information state tensor. We should not make that the default for NFSP, but if the user were to specify it then we can handle it). If you go this route, I'd encourage you to submit a PR because that'd be generally useful. (Also anything you add to spades, I would encourage you also submit a PR for it).

BTW, please see the OpenSpiel paper for a difference in information state and observation tensors. The former normally includes all the history (all the public actions chosen from the start of the game).

Also, I suspect you'll be interested in this paper: https://arxiv.org/abs/2404.13150

@lanctot I'll probably try to add the flag to NFSP if that'd be helpful, but first I've still got this strange behavior going on - altering any of the GameType values in spades.cc isn't being recognized when generating a playthrough:

const GameType kGameType{/*short_name=*/"spades",
                         /*long_name=*/"Partnership Spades",
                         GameType::Dynamics::kSequential,
                         GameType::ChanceMode::kExplicitStochastic,
                         GameType::Information::kImperfectInformation,
                         GameType::Utility::kGeneralSum,
                         GameType::RewardModel::kTerminal,
                         /*max_num_players=*/kNumPlayers,
                         /*min_num_players=*/kNumPlayers,
                         /*provides_information_state_string=*/true,
                         /*provides_information_state_tensor=*/true,
                         /*provides_observation_string=*/true,
                         /*provides_observation_tensor=*/true,
                         /*parameter_specification=*/
                         {
                             // Whether to end the game early if score gets too low
                             {"use_mercy_rule", GameParameter(true)},
                             // If using mercy rule, the threshold of negative points
                             {"mercy_threshold", GameParameter(-350)},
                             // Amount of points needed to win the game
                             {"win_threshold", GameParameter(500)},
                             // The amount to add to reward return for winning
                             // (Will subtract for losing by mercy rule)
                             {"win_or_loss_bonus", GameParameter(200)},
                             // Number of played tricks in observation tensor
                             {"num_tricks", GameParameter(2)},
                         }};

Changing to kZeroSum, lowering min_num_players, flipping the provides information/observation flags, etc., results in no changes to the generated playthrough in spades.txt (whether I run generate_new_playthrough.sh spades by itself or after build_and_run_tests.sh --build_only=true).

Is there perhaps a cached file overriding the values from spades.cc?

As for the tensors, if my understanding is correct, I believe the way observation is setup in Spades should overlap the information state completely:

  static int GetPlayTensorSize(int num_tricks) {
    return kNumBids * kNumPlayers                   // What each player's contract is
           + kNumCards                              // Our remaining cards
           + num_tricks * kNumPlayers * kNumCards   // Number of played tricks
           + kNumTricks * kNumPlayers;              // Number of tricks each player has won
  }

The observation tensor already includes the cards played by every player in every trick as well as each player's bid, so that should encompass the entirety of public history. (In Bridge the whole auction process would need to be represented, but in Spades the entire auction is a single bid from each player which is represented by 'each player's contract' in the tensor.)

Also, thank you for sharing that paper! It looks like it could be very helpful.

This is separate from whatever is going on with the playthrough generation, but I'm also realizing the error about "InformationStateString is not implemented" is not coming from the NFSP implementation itself, but rather the training script.

Following leduc_nfsp.py, I have for evaluation:

if FLAGS.evaluation_metric == "exploitability":
          # Avg exploitability is implemented only for 2 players constant-sum
          # games, use nash_conv otherwise.
          expl = exploitability.exploitability(env.game, joint_avg_policy)
          logging.info("[%s] Exploitability AVG %s", ep + 1, expl)
        elif FLAGS.evaluation_metric == "nash_conv":
          nash_conv = exploitability.nash_conv(env.game, joint_avg_policy)
          logging.info("[%s] NashConv %s", ep + 1, nash_conv)
        else:
          raise ValueError(" ".join(("Invalid evaluation metric, choose from",
                                     "'exploitability', 'nash_conv'.")))

and from the line nash_conv = exploitability.nash_conv(env.game, joint_avg_policy) is where the error comes from.

  File ".../algorithms/exploitability.py", line 195, in nash_conv
    best_response_values = np.array([
  File ".../algorithms/exploitability.py", line 196, in <listcomp>
    pyspiel_best_response.BestResponsePolicy(
  File ".../algorithms/best_response.py", line 113, in __init__
    self.infosets = self.info_sets(root_state)
  File ".../algorithms/best_response.py", line 121, in info_sets
    infosets[s.information_state_string(self._player_id)].append((s, p))
pyspiel.SpielError: InformationStateString is not implemented.

Hi @i-Madsen

Sorry for the delay in response.

For the first message: that is strange, can you confirm that you are running make in the build directory and that the libopen_spiel.so is being updated as a results? (i.e. by comparing the date and time of the file before and after the change)? If you recompile the examples/example.cc, you can also load the game and inspect the game type by calling game->GameType() and inspect it manually. Is it different? Possibly python is picking up your lib_openspiel.so from a different place (like if you installed it via pip or have the PYTHONPATH not set properly)

The code you have listed looks like an observation tensor. An information state tensor usually involves the entire history of the moves. Check out the ones for Kuhn and Leduc poker, for instance. They encode the history of bets in the information state tensor.

For the second message, indeed -- this is because exploitability is run in a tabular way using the information state string as keys for values. Exploitability is only something you can compute on small games (Kuhn, Leduc, etc.) so I would skip that part of the code in NFSP. This won't be possible in Spades.. it's too large. Probably your best bet -- if you want to measure any kind of exploitability -- would be to run DQN on the method and see if it can detect exploits (we have an example of how to do that here). It's the only way. Probably a better thing to do for now would be just to play against it and then try it against other strong agents. (You can compare also against the random player, some fixed hard-coded strategy, and/or IS-MCTS).

Hey @lanctot

Not sure when it got generated, but there was in fact a duplicate pyspiel.so file in the project's root folder that was being seen before the ones in build/python (both are in PYTHONPATH). Deleting the one in the root folder solved the problem.

With that working (and testing against Random agents for now instead of trying to measure exploitability), NFSP seems like it'll run now. It uses time_step.observations["info_state"] that it gets from the RL environment, which as you mentioned before will implicitly decide to use the observation tensor - so shouldn't need to add a flag!

After checking out the Kuhn and Leduc examples, I may still just be confused on how to represent the Information State, but it still seems like it would completely overlap with the Observation in the case of Spades. For example, looking at Kuhn:

// Betting sequence.
    if (iig_obs_type_.public_info) {
      if (iig_obs_type_.perfect_recall) {
        auto out = allocator->Get("betting", {2 * num_players - 1, 2});
        for (int i = num_players; i < state.history_.size(); ++i) {
          out.at(i - num_players, state.history_[i].action) = 1;
        }
      } else... 

Assuming the 3 player variant, the betting representation for the Information state would look something like:

[Pass, Bet] // p1 action
[Pass, Bet] // p2 action
[Pass, Bet] // p3 action
[Pass, Bet] // p1 action
[Pass, Bet] // p2 action

since p1 and p2 could both potentially Pass and then Call or Fold.

Following this format, Spades bidding would look like:

[0, 1, 2,...13] // p1 bid in range [0, 13]
[0, 1, 2,...13] // p2 bid in range [0, 13]
[0, 1, 2,...13] // p3 bid in range [0, 13]
[0, 1, 2,...13] // p4 bid in range [0, 13]

Each player in Spades makes 1 bid (there's no checking, passing, folding, raising, ect.) as their sole action during the bidding phase. This then is their contract for the round (the partnership needs to meet the combined contract, but '0' bids must be met by the individual).

Since the 'bid' action is the same as the 'contract' for that player, this is already represented in the Observation Tensor in this way. So is there additional information or a different way this data needs to be represented in the Information State Tensor? Or in the case of Spades, is it uniquely overlapping so that it's the same?

Thanks for all the continued help; very much appreciate it! After I get this part all settled and cleaned up, I should be able to make another PR for the updates to the Spades code.

Ah right, good. I didn't remember if NFSP used the RL environment or bypassed it. In that case you are right, no need for the flag because the RL environment takes care of that.

I think you should really take a look at the OpenSpiel paper where the difference between information states and observations are described. You can think of an information state as a sequence of (observation, action) from the start of the game -- literally every one, and in the correct order.

In Spades, this is not just the bid information, it's literally every card you play and every observation you've received (like who played what, and when). Whereas an observation is mostly just the incremental change, i.e. what's new from the last observation.

Ok so I just checked your observation tensor. It does seem to include the history. It does indeed look more like an information state than an observation because it's encoding a lot of the history. But maybe it's collapsing the order in which things happened? (Which would be ok -- since you're calling it an observation). The (perfect recall) information state would have to preserve the order in which everything happened.

But mainly, don't worry about all this unless you want to learn more about extensive-form games, because Spades is too big to include a full information state and you won't be able to compute exploitability exactly. So as long as you think you're including everything in the observation that is needed to learn efficiently, you should be fine.

Hope this helps!

Great, yes very helpful! I think I'm starting to get it; I'll definitely take another look at the paper to solidify my understanding for future reference. I'll go ahead with the Observation Tensor then and see if I can make some progress with training as well as clean up the code for a PR.

Thanks for your contributions. Closing now due to resolution. Please re-open if you need to discuss further!