Big Two with Deep Reinforcement Learning

DOI

Big2-RL is a reinforcement learning framework for Big Two (Cantonese: 鋤大弟), a four-player card-shedding game popular in many Southeast Asia countries played with a standard 52-card deck without jokers. Our framework uses multiprocessing and is heavily inspired by DouZero and TorchBeast (see Acknowledgements).

Each player's goal is to empty their hand of all cards before other players. Cards are shed (or played) through tricks consisting of specific hand types (singles, pairs, triples and five-card hands), and each player must either follow the trick by playing a higher-ranked hand of the same type as the person who led the trick, or pass. If all other players pass, the person who won the trick "leads" the next trick and chooses which hand type to play. When one player empties their hand (the winner), the remaining players are penalised based on the number of cards left in their hands, and these penalties are awarded to the winner.

In contrast to Dou Dizhu, which has clearly defined roles for each player, collaboration in Big Two is much more fluid. For instance, it is common for players to pass when the opponent preceding them has just played in the hopes of preserving higher ranked cards for later and having the opportunity to play second in case the player before them gets to lead the next trick. Conversely, players tend to play cards if the player after them has played because if the player following them leads the next trick, they will have to play last on the subsequent round. Additionally, Dou Dizhu has no additional penalty for having lots of unplayed cards, whereas Big Two is inherently more risky since although more cards usually means more manoeuvrability, it also incurs a higher penalty if they lose, and vice versa.

In this work, we explore a variety of model structures and evaluate their respective performances. Please read our paper for more details.

Paper for this work

Paper

If you find this project helpful in your research, please cite our work.

@software{big2drl,
  author = {J. Chow and J. Cheng},
  doi = {10.5281/zenodo.7811506},
  title = {{Deep Reinforcement Learning for Big Two}},
  url = {https://github.com/johnnyhoichuen/big2-rl/},
  version = {1.0.0},
  year = {2022}
}

Installation

The training code is designed for GPUs. Thus, you need to first install CUDA if you want to train models. You may refer to this guide. For evaluation, CUDA is optional and you can use CPU for evaluation.

First, clone the repo with:

git clone https://github.com/johnnyhoicheng/big2-rl.git

Make sure you have python 3.6+ installed. Install dependencies:

cd big2-rl
pip3 install -r requirements.txt

Training

Using SLURM

We used HKUST's HPC3 (High Performance Computing Cluster) for training. To schedule jobs in the cluster:

cd slurm_report
sbatch slurm_cpu_vs_ppo.sh
sbatch slurm_cpu_vs_prior.sh
sbatch slurm_cpu_vs_rand.sh

To use GPU for training, run

python3 train.py

This will train on one GPU. To train on multiple GPUs. Use the following arguments.

  • --gpu_devices: what gpu devices are visible
  • --num_actor_devices: how many of the GPU devices will be used for simulation, i.e., self-play
  • --num_actors: how many actor processes will be used for each device
  • --training_device: which device will be used for training (learner process)

For example, if we have 4 GPUs, where we want to use the first 3 GPUs to have 15 actors each for simulating and the 4th GPU for training, we can run the following command:

python3 train.py --gpu_devices 0,1,2,3 --num_actor_devices 3 --num_actors 15 --training_device 3

To use CPU training or simulation (Windows can only use CPU for actors), use the following arguments:

  • --training_device cpu: Use CPU to train the model
  • --actor_device_cpu: Use CPU as actors

For example, use the following command to run everything on CPU:

python3 train.py --actor_device_cpu --training_device cpu

The following command only runs actors on CPU:

python3 train.py --actor_device_cpu

For more customized configuration of training, see the following optional arguments:

--xpid XPID           Experiment id (default: big2rl)
--save_interval SAVE_INTERVAL
                    Time interval (in minutes) at which to save the model
--opponent_agent OPPONENT_AGENT
                    Type of opponent agent to be placed in other 3 positions which model will be tested again. Values = {prior, ppo,
                    random}
--actor_device_cpu    Use CPU as actor device
--gpu_devices GPU_DEVICES
                    Which GPUs to be used for training
--num_actor_devices NUM_ACTOR_DEVICES
                    The number of devices used for simulation
--num_actors NUM_ACTORS
                    The number of actors for each simulation device
--training_device TRAINING_DEVICE
                    The index of the GPU used for training models. `cpu` means using cpu
--load_model          Load an existing model
--disable_checkpoint  Disable saving checkpoint
--savedir SAVEDIR     Root dir where experiment data will be saved
--total_frames TOTAL_FRAMES
                    Total environment frames to train for
--exp_epsilon EXP_EPSILON
                    The probability for exploration
--batch_size BATCH_SIZE
                    Learner batch size
--unroll_length UNROLL_LENGTH
                    The unroll length (time dimension)
--num_buffers NUM_BUFFERS
                    Number of shared-memory buffers for a given actor device
--num_threads NUM_THREADS
                    Number learner threads
--max_grad_norm MAX_GRAD_NORM
                    Max norm of gradients
--learning_rate LEARNING_RATE
                    Learning rate
--alpha ALPHA         RMSProp smoothing constant
--momentum MOMENTUM   RMSProp momentum
--epsilon EPSILON     RMSProp epsilon
--model_type MODEL_TYPE
                    Model architecture to use for DMC

Evaluation

The evaluation can be performed with GPU or CPU (GPU will be much faster). Pretrained model is available in baselines/. The performance is evaluated through self-play.

  • ppo: agents based on Charlesworth's PPO model
  • prior: evaluate against DMC agents trained for some number of iterations
  • random: agents that play randomly (select a move from the set of legal moves with uniform probability)

Step 1: Generate evaluation data

python3 generate_eval_data.py

Some important hyperparameters are as follows.

  • --output: where the pickled data will be saved
  • --num_games: how many random games will be generated, default 10000

Step 2: Evaluation & Self-Play

python3 evaluate.py

Some important hyperparameters are as follows.

  • --model_type: the model architecture to use (standard, convres, convolutional, residual)
  • --train_opponent: which agent the DMC agent will be trained against (ppo, random, or a prior agent). Due to symmetry, the trainable agent is placed in South and all other positions are occupied by copies of the same train_opponent agent.
  • --eval_opponent: which agent the DMC agent will be evaluated against (ppo, random, or a prior agent). Due to symmetry, the trained DMC agent is placed in South and all other positions are occupied by copies of the same eval_opponent agent.
  • --frames_trained: the number of frames the trainable DMC agent is trained for.
  • --eval_data: the pickle file that contains evaluation data. generated from generate_eval_data.py
  • --num_workers: how many subprocesses will be used to run evaluation data
  • --gpu_device: which GPU to use. It will use CPU by default

For example, the following command evaluates performance of a standard DMC Agent trained with PPO for 2000000 frames against random agents in all other positions in evaluation.

python evaluate.py --model_type 'standard' --train_opponent 'ppo' --eval_opponent 'random' --frames_trained 2000000

By default, our model will be saved in big2rl_checkpoints/big2rl every 10 minutes.

Play

You can also play against our pre-trained models.

python3 play-big2.py

Some important hyperparameters are as follows.

  • --east: path of the model for the East player to use. Can be 'ppo' or 'random' or the location of a .tar or .ckpt file for a 'prior' model
  • --north: path of the model for the North player to use. Can be 'ppo' or 'random' or the location of a .tar or .ckpt file for a 'prior' model
  • --west: path of the model for the West player to use. Can be 'ppo' or 'random' or the location of a .tar or .ckpt file for a 'prior' model

Settings

You can also modify settings.py before training, evaluation or play (change order of straights, flushes, and penalties.) For details, please refer to here.

Core Team

Acknowledgments

  • Zha, Daochen et al. “DouZero: Mastering DouDizhu with Self-Play Deep Reinforcement Learning.” ICML (2021).
  • H. Charlesworth. “Application of Self-Play Reinforcement Learning to a Four-Player Game of Imperfect Information”. (2018)