/Poker_CFR

Counterfactual Regret Minimization for poker games

Primary LanguageJupyter Notebook

Poker_CFR

Counterfactual Regret Minimization for poker games.

CFR Algorithm

In essence, CFR is a regret matching procedure applied to optimize entity called immediate counterfactual regret.

Average overall regret

Let be the probability of history occurring if players choose actions according to . The overall value to player of a strategy profile is then the expected payoff of the resulting terminal node:

Let be the strategy used by player on round . The average overall regret of player at time is:

Moreover, define to be the average strategy for player from time 1 to .

Counterfactual utility

For every opponent’s hand (game state ), we use the probability of reaching assuming we wanted to get to . So instead of using our regular strategy from strategy profile we modify it a bit so it always tries to reach our current game state – meaning that for each information set prior to currently assumed game state we pretend we always played pure behavioral strategy where the whole probability mass was placed in action that was actually played and led to current assumed state – which is in fact counterfactual, in opposition to facts, because we really played according to . In practice then we just consider our opponent contribution to the probability of reaching currently assumed game state .

Formally, counterfactual utility for information set , player and strategy is given by:

denominator is a sum of our counterfactual weights – normalizing constant.

You may find unnormalized form of the above – it is ok and let’s actually have it too, it will come in handy later:

Immediate Counterfactual Regret

To introduce counterfactual regret minimization we will need to look at the poker game from a certain specific angle. First of all we will be looking at single information set, single decision point. We will consider acting in this information set repeatedly over time with the goal to act in a best possible way with respect to certain reward measure.

Assuming we are playing as player , let’s agree that reward for playing action is unnormalized counterfactual utility under assumption we played action (let’s just assume this is how environment reward us). Entity in consideration is then defined as:

where is game state implied by playing action in game state . We can do it becaue we assume we played with probability 1.

We can define regret in our setting to be:

which called Immediate Counterfactual Regret.

Similarly, Immediate Counterfactual Regret of not playing action is given by:

For more information about these concepts, you can refer to the original paper and this post.