
RL environment to study the evacuation of pedestrians from dimly rooms.

RL environment to study the evacuation of pedestrians for dimly rooms.

Learning curves for stable-baselines3 PPO agent

wandb report: smoothed learning curves

wandb report: comparing rewards study

Examples of trajectories

Comments Strict leader
(enslaving degree = 1.0)
Calm leader
(enslaving degree < 1.0)
exitrew & followrew Alt text
After saving a big group of pedestrians, leader helped 2 groups of lost pedestrians to find the way to exit
Leader tends to work with big groups of pedestrians and navigates them to exit zone
only exitrew Alt text
At the beginning of the episode leader helps pedestrians near exit and it the end finds the lost ones left far from exit
Here we can see how pedestrians navigate themselves based on the directions of their neighbours. Leader is trying to collect big group to navigate it to exit.
only exitrew Alt text
Due to need to escort pedestrians to exit zone, leader tries to collect as much pedestrians as he can on his first reach of exit.
Sometimes pedestrians can suddenly panic and try to move in bad direction. Leader mey try to return them or catch all close ones.
only followers reward Alt text
Even when leader is not given the reward for pedestrians reaching exit zone, he tries to escort them to exit asap.


git clone https://github.com/cinemere/evacuation
cd evacuation
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Quick start

Setup environment variables

Here are the default values, which can be changed via environmental variables:

TBLOGS_DIR ?= "./saved_data/tb_logs"
WANDB_DIR ?= "./saved_data/"
CONFIG ?= "<path-to-yaml-conifg>"  # to setup arguments from config 
DEVICE ?= "cpu"

Wandb cheat sheet

To enable wandb logging you need to create your wandb profile and run the following once:

wandb init
  • To disable wandb logging (for debugging or other reason) you need to run:
    wandb disabled
  • To enable wandb logging (when you need to turn on looging again) you need to run:
    wandb enabled

Run experiments! ๐Ÿƒ

To run experiment from command line:

python src/main.py --env.experiment-name "my-first-experiment"

To use evacuation env in your code:

from src.env import setup_env, EnvConfig, EnvWrappersConfig
from src.agents import RandomAgent

# Initialize environment
env = setup_env(EnvConfig, EnvWrappersConfig)

# Initialize random agent
random_agent = RandomAgent(env.action_space)

# Initialize episode
obs, _ = env.reset()
terminated, truncated = False, False

# Episode loop
while not (terminated or truncated):
    action = random_agent.act(obs)
    obs, reward, terminated, truncated, _ = env.step(action)

env.save_animation()      # save episode trajectory in giff
env.render()              # save episode trajectory in png

To run learning of an RPO agent with transformer embedding use:

python3 src/main.py --env.experiment-name "my-experiment" \
                    --wrap.positions rel \
                    --wrap.statuses ohe \
                    --wrap.type Box \
                    model:clean-rl-config \

To run learning of an RPO agent with gravity encoding of observations use:

python3 src/main.py --env.experiment-name "my-experiment" \
                    --wrap.positions grav \
                    model:clean-rl-config \
                    model.network:rpo-transformer-embedding-config \


Input parameters

Most valuable parametes can be set throw command line. However some parameters are in files, here such parameters are outlined:

  • src/env/constants.py $\rightarrow$ switch distances:

    • SWITCH_DISTANCE_TO_LEADER $\rightarrow$ radius of catch by leader
    • SWITCH_DISTANCE_TO_OTHER_PEDESTRIAN $\rightarrow$ radius of interactions between pedestrians
    • SWITCH_DISTANCE_TO_EXIT $\rightarrow$ raduis of the exit zone
    • SWITCH_DISTANCE_TO_ESCAPE $\rightarrow$ raduis of the escape point
  • arguments passed to EvacuationEnv (src/utils.py)

usage: main.py [-h] [OPTIONS] [{model:sb-config,model:clean-rl-config,model:type}]

        To use yaml config set the env variable `CONFIG`:

        `CONFIG=<path-to-yaml-config> python main.py`

โ•ญโ”€ options โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ -h, --help              show this help message and exit                               โ”‚
โ•ญโ”€ env options โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ env params                                                                            โ”‚
โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”‚
โ”‚ --env.experiment-name STR                                                             โ”‚
โ”‚                         prefix of the experiment name for logging results (default:   โ”‚
โ”‚                         test)                                                         โ”‚
โ”‚ --env.number-of-pedestrians INT                                                       โ”‚
โ”‚                         number of pedestrians in the simulation (default: 10)         โ”‚
โ”‚ --env.width FLOAT       geometry of environment space: width (default: 1.0)           โ”‚
โ”‚ --env.height FLOAT      geometry of environment space: height (default: 1.0)          โ”‚
โ”‚ --env.step-size FLOAT   length of pedestrian\'s and agent\'s step Typical expected    โ”‚
โ”‚                         values: 0.1, 0.05, 0.01 (default: 0.01)                       โ”‚
โ”‚ --env.noise-coef FLOAT  noise coefficient of randomization in viscek model (default:  โ”‚
โ”‚                         0.2)                                                          โ”‚
โ”‚ --env.eps FLOAT         eps (default: 1e-08)                                          โ”‚
โ”‚ --env.enslaving-degree FLOAT                                                          โ”‚
โ”‚                         enslaving degree of leader in generalized viscek model vary   โ”‚
โ”‚                         in (0; 1], where 1 is full enslaving. Typical expected        โ”‚
โ”‚                         values: 0.1, 0.5, 1. (default: 1.0)                           โ”‚
โ”‚ --env.is-new-exiting-reward, --env.no-is-new-exiting-reward                           โ”‚
โ”‚                         if True, positive reward will be given for each pedestrian,   โ”‚
โ”‚                         entering the exiting zone (default: False)                    โ”‚
โ”‚ --env.is-new-followers-reward, --env.no-is-new-followers-reward                       โ”‚
โ”‚                         if True, positive reward will be given for each pedestrian,   โ”‚
โ”‚                         entering the leader\'s zone of influence (default: True)      โ”‚
โ”‚ --env.intrinsic-reward-coef FLOAT                                                     โ”‚
โ”‚                         coefficient in front of intrinsic reward (default: 0.0)       โ”‚
โ”‚ --env.is-termination-agent-wall-collision,                                            โ”‚
โ”‚ --env.no-is-termination-agent-wall-collision                                          โ”‚
โ”‚                         if True, agent\'s wall collision will terminate episode       โ”‚
โ”‚                         (default: False)                                              โ”‚
โ”‚ --env.init-reward-each-step FLOAT                                                     โ”‚
โ”‚                         constant reward given on each step of agent. Typical expected โ”‚
โ”‚                         values: 0, -1. (default: -1.0)                                โ”‚
โ”‚ --env.max-timesteps INT                                                               โ”‚
โ”‚                         max timesteps before truncation (default: 2000)               โ”‚
โ”‚ --env.n-episodes INT    number of episodes already done (for pretrained models)       โ”‚
โ”‚                         (default: 0)                                                  โ”‚
โ”‚ --env.n-timesteps INT   number of timesteps already done (for pretrained models)      โ”‚
โ”‚                         (default: 0)                                                  โ”‚
โ”‚ --env.render-mode {None}|STR                                                          โ”‚
โ”‚                         render mode (has no effect) (default: None)                   โ”‚
โ”‚ --env.draw, --env.no-draw                                                             โ”‚
โ”‚                         enable saving of animation at each step (default: False)      โ”‚
โ”‚ --env.verbose, --env.no-verbose                                                       โ”‚
โ”‚                         enable debug mode of logging (default: False)                 โ”‚
โ”‚ --env.giff-freq INT     frequency of logging the giff diagram (default: 500)          โ”‚
โ”‚ --env.wandb-enabled, --env.no-wandb-enabled                                           โ”‚
โ”‚                         enable wandb logging (if True wandb.init() should be called   โ”‚
โ”‚                         before initializing the environment) (default: True)          โ”‚
โ”‚ --env.path-giff STR     path to save giff animations: {path_giff}/{experiment_name}   โ”‚
โ”‚                         (default: saved_data/giff)                                    โ”‚
โ”‚ --env.path-png STR      path to save png images of episode trajectories:              โ”‚
โ”‚                         {path_png}/{experiment_name} (default: saved_data/png)        โ”‚
โ”‚ --env.path-logs STR     path to save logs: {path_logs}/{experiment_name} (default:    โ”‚
โ”‚                         saved_data/logs)                                              โ”‚
โ•ญโ”€ wrap options โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ env wrappers params                                                                   โ”‚
โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”‚
โ”‚ --wrap.num-obs-stacks INT                                                             โ”‚
โ”‚                         number of times to stack observation (default: 1)             โ”‚
โ”‚ --wrap.positions {abs,rel,grav}                                                       โ”‚
โ”‚                         positions:                                                    โ”‚
โ”‚                         - 'abs': absolute coordinates                                 โ”‚
โ”‚                         - 'rel': relative coordinates                                 โ”‚
โ”‚                         - 'grav': gradient gravity potential encoding                 โ”‚
โ”‚                         (GravityEncoding) (default: abs)                              โ”‚
โ”‚ --wrap.statuses {no,ohe,cat}                                                          โ”‚
โ”‚                         add pedestrians statuses to obeservation as one-hot-encoded   โ”‚
โ”‚                         columns. NOTE: this value has no effect when                  โ”‚
โ”‚                         `positions`='grad' is selected. (default: no)                 โ”‚
โ”‚ --wrap.type {Dict,Box}  concatenate Dict-type observation to a Box-type observation   โ”‚
โ”‚                         (with added statuses to the observation) (default: Dict)      โ”‚
โ”‚ --wrap.alpha FLOAT      alpha parameter of GravityEncoding. The value of alpha        โ”‚
โ”‚                         determines the strength and shape of the potential function.  โ”‚
โ”‚                         Higher value results in a stronger repulsion between the      โ”‚
โ”‚                         agent and the pedestrians, a lower value results in a weaker  โ”‚
โ”‚                         repulsion. Typical expected values vary from 1 to 5.          โ”‚
โ”‚                         (default: 3)                                                  โ”‚
โ•ญโ”€ optional subcommands โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ select the config of model  (default: model:type)                                     โ”‚
โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”‚
โ”‚ [{model:sb-config,model:clean-rl-config,model:type}]                                  โ”‚
โ”‚     model:sb-config     Stable Baselines Model Config                                 โ”‚
โ”‚     model:clean-rl-config                                                             โ”‚
โ”‚                         Clean RL Model Config                                         โ”‚
โ”‚     model:type                                                                        โ”‚
โ•ญโ”€ model.agent options โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ select the parametrs of trainig the agent                                             โ”‚
โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”‚
โ”‚ --model.agent.exp-name STR                                                            โ”‚
โ”‚     the name of this experiment (default: rpo-agent)                                  โ”‚
โ”‚ --model.agent.seed INT                                                                โ”‚
โ”‚     seed of the experiment (default: 1)                                               โ”‚
โ”‚ --model.agent.torch-deterministic, --model.agent.no-torch-deterministic               โ”‚
โ”‚     if toggled, `torch.backends.cudnn.deterministic=False` (default: True)            โ”‚
โ”‚ --model.agent.cuda, --model.agent.no-cuda                                             โ”‚
โ”‚     if toggled, cuda will be enabled by default (default: True)                       โ”‚
โ”‚ --model.agent.total-timesteps INT                                                     โ”‚
โ”‚     total timesteps of the experiments (default: 80000000)                            โ”‚
โ”‚ --model.agent.learning-rate FLOAT                                                     โ”‚
โ”‚     the learning rate of the optimizer (default: 0.0003)                              โ”‚
โ”‚ --model.agent.num-envs INT                                                            โ”‚
โ”‚     the number of parallel game environments (default: 3)                             โ”‚
โ”‚ --model.agent.num-steps INT                                                           โ”‚
โ”‚     the number of steps to run in each environment per policy rollout (default: 2048) โ”‚
โ”‚ --model.agent.anneal-lr, --model.agent.no-anneal-lr                                   โ”‚
โ”‚     Toggle learning rate annealing for policy and value networks (default: True)      โ”‚
โ”‚ --model.agent.gamma FLOAT                                                             โ”‚
โ”‚     the discount factor gamma (default: 0.99)                                         โ”‚
โ”‚ --model.agent.gae-lambda FLOAT                                                        โ”‚
โ”‚     the lambda for the general advantage estimation (default: 0.95)                   โ”‚
โ”‚ --model.agent.num-minibatches INT                                                     โ”‚
โ”‚     the number of mini-batches (default: 32)                                          โ”‚
โ”‚ --model.agent.update-epochs INT                                                       โ”‚
โ”‚     the K epochs to update the policy (default: 10)                                   โ”‚
โ”‚ --model.agent.norm-adv, --model.agent.no-norm-adv                                     โ”‚
โ”‚     Toggles advantages normalization (default: True)                                  โ”‚
โ”‚ --model.agent.clip-coef FLOAT                                                         โ”‚
โ”‚     the surrogate clipping coefficient (default: 0.2)                                 โ”‚
โ”‚ --model.agent.clip-vloss, --model.agent.no-clip-vloss                                 โ”‚
โ”‚     Toggles whether or not to use a clipped loss for the value function, as per the   โ”‚
โ”‚     paper. (default: True)                                                            โ”‚
โ”‚ --model.agent.ent-coef FLOAT                                                          โ”‚
โ”‚     coefficient of the entropy (default: 0.0)                                         โ”‚
โ”‚ --model.agent.vf-coef FLOAT                                                           โ”‚
โ”‚     coefficient of the value function (default: 0.5)                                  โ”‚
โ”‚ --model.agent.max-grad-norm FLOAT                                                     โ”‚
โ”‚     the maximum norm for the gradient clipping (default: 0.5)                         โ”‚
โ”‚ --model.agent.target-kl {None}|FLOAT                                                  โ”‚
โ”‚     the target KL divergence threshold (default: None)                                โ”‚
โ•ญโ”€ subcommands โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ select the network params                                                             โ”‚
โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”‚
โ”‚ {model.network:rpo-linear-network-config,model.network:rpo-transformer-embedding-conโ€ฆ โ”‚
โ”‚     model.network:rpo-linear-network-config                                           โ”‚
โ”‚     model.network:rpo-transformer-embedding-config                                    โ”‚
โ”‚     RPO agent network with transforment encoding                                      โ”‚
โ”‚     model.network:rpo-deep-sets-embedding-config                                      โ”‚
โ”‚     RPO agent network with deep sets encoding                                         โ”‚


Outputs are to be saved in following directories / files:

  • saved_data/giff/ $\rightarrow$ episode trajectoriy in giff

  • saved_data/png/ $\rightarrow$ episode trajectory in png

  • saved_data/models/ $\rightarrow$ trained models

  • saved_data/logs/ $\rightarrow$ ${exp_name}.txt log of episode trajectories

  • saved_data/tb_logs/ $\rightarrow$ tensorboard logs

  • saved_data/config/ $\rightarrow$ ${exp_name}.yaml config of current experiment

  • wandb/ $\rightarrow$ wandb logs

    Example of logging of conducted experiment
