RL environment to study the evacuation of pedestrians for dimly rooms.
Learning curves for stable-baselines3 PPO agent
wandb report: smoothed learning curves
wandb report: comparing rewards study
git clone https://github.com/cinemere/evacuation
cd evacuation
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
Here are the default values, which can be changed via environmental variables:
TBLOGS_DIR ?= "./saved_data/tb_logs"
WANDB_DIR ?= "./saved_data/"
CONFIG ?= "<path-to-yaml-conifg>" # to setup arguments from config
DEVICE ?= "cpu"
To enable wandb logging you need to create your wandb profile and run the following once:
wandb init
- To disable wandb logging (for debugging or other reason) you need to run:
wandb disabled
- To enable wandb logging (when you need to turn on looging again) you need to run:
wandb enabled
To run experiment from command line:
python src/main.py --env.experiment-name "my-first-experiment"
To use evacuation env in your code:
from src.env import setup_env, EnvConfig, EnvWrappersConfig
from src.agents import RandomAgent
# Initialize environment
env = setup_env(EnvConfig, EnvWrappersConfig)
# Initialize random agent
random_agent = RandomAgent(env.action_space)
# Initialize episode
obs, _ = env.reset()
terminated, truncated = False, False
# Episode loop
while not (terminated or truncated):
action = random_agent.act(obs)
obs, reward, terminated, truncated, _ = env.step(action)
env.save_animation() # save episode trajectory in giff
env.render() # save episode trajectory in png
To run learning of an RPO agent with transformer embedding use:
python3 src/main.py --env.experiment-name "my-experiment" \
--wrap.positions rel \
--wrap.statuses ohe \
--wrap.type Box \
model:clean-rl-config \
model.network:rpo-transformer-embedding-config
To run learning of an RPO agent with gravity encoding of observations use:
python3 src/main.py --env.experiment-name "my-experiment" \
--wrap.positions grav \
model:clean-rl-config \
model.network:rpo-transformer-embedding-config \
model.network:rpo-linear-network-config
Most valuable parametes can be set throw command line. However some parameters are in files, here such parameters are outlined:
-
src/env/constants.py
$\rightarrow$ switch distances:-
SWITCH_DISTANCE_TO_LEADER
$\rightarrow$ radius of catch by leader -
SWITCH_DISTANCE_TO_OTHER_PEDESTRIAN
$\rightarrow$ radius of interactions between pedestrians -
SWITCH_DISTANCE_TO_EXIT
$\rightarrow$ raduis of the exit zone -
SWITCH_DISTANCE_TO_ESCAPE
$\rightarrow$ raduis of the escape point
-
-
arguments passed to
EvacuationEnv
(src/utils.py
)
usage: main.py [-h] [OPTIONS] [{model:sb-config,model:clean-rl-config,model:type}]
To use yaml config set the env variable `CONFIG`:
`CONFIG=<path-to-yaml-config> python main.py`
โญโ options โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ -h, --help show this help message and exit โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโ env options โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ env params โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ --env.experiment-name STR โ
โ prefix of the experiment name for logging results (default: โ
โ test) โ
โ --env.number-of-pedestrians INT โ
โ number of pedestrians in the simulation (default: 10) โ
โ --env.width FLOAT geometry of environment space: width (default: 1.0) โ
โ --env.height FLOAT geometry of environment space: height (default: 1.0) โ
โ --env.step-size FLOAT length of pedestrian\'s and agent\'s step Typical expected โ
โ values: 0.1, 0.05, 0.01 (default: 0.01) โ
โ --env.noise-coef FLOAT noise coefficient of randomization in viscek model (default: โ
โ 0.2) โ
โ --env.eps FLOAT eps (default: 1e-08) โ
โ --env.enslaving-degree FLOAT โ
โ enslaving degree of leader in generalized viscek model vary โ
โ in (0; 1], where 1 is full enslaving. Typical expected โ
โ values: 0.1, 0.5, 1. (default: 1.0) โ
โ --env.is-new-exiting-reward, --env.no-is-new-exiting-reward โ
โ if True, positive reward will be given for each pedestrian, โ
โ entering the exiting zone (default: False) โ
โ --env.is-new-followers-reward, --env.no-is-new-followers-reward โ
โ if True, positive reward will be given for each pedestrian, โ
โ entering the leader\'s zone of influence (default: True) โ
โ --env.intrinsic-reward-coef FLOAT โ
โ coefficient in front of intrinsic reward (default: 0.0) โ
โ --env.is-termination-agent-wall-collision, โ
โ --env.no-is-termination-agent-wall-collision โ
โ if True, agent\'s wall collision will terminate episode โ
โ (default: False) โ
โ --env.init-reward-each-step FLOAT โ
โ constant reward given on each step of agent. Typical expected โ
โ values: 0, -1. (default: -1.0) โ
โ --env.max-timesteps INT โ
โ max timesteps before truncation (default: 2000) โ
โ --env.n-episodes INT number of episodes already done (for pretrained models) โ
โ (default: 0) โ
โ --env.n-timesteps INT number of timesteps already done (for pretrained models) โ
โ (default: 0) โ
โ --env.render-mode {None}|STR โ
โ render mode (has no effect) (default: None) โ
โ --env.draw, --env.no-draw โ
โ enable saving of animation at each step (default: False) โ
โ --env.verbose, --env.no-verbose โ
โ enable debug mode of logging (default: False) โ
โ --env.giff-freq INT frequency of logging the giff diagram (default: 500) โ
โ --env.wandb-enabled, --env.no-wandb-enabled โ
โ enable wandb logging (if True wandb.init() should be called โ
โ before initializing the environment) (default: True) โ
โ --env.path-giff STR path to save giff animations: {path_giff}/{experiment_name} โ
โ (default: saved_data/giff) โ
โ --env.path-png STR path to save png images of episode trajectories: โ
โ {path_png}/{experiment_name} (default: saved_data/png) โ
โ --env.path-logs STR path to save logs: {path_logs}/{experiment_name} (default: โ
โ saved_data/logs) โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโ wrap options โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ env wrappers params โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ --wrap.num-obs-stacks INT โ
โ number of times to stack observation (default: 1) โ
โ --wrap.positions {abs,rel,grav} โ
โ positions: โ
โ - 'abs': absolute coordinates โ
โ - 'rel': relative coordinates โ
โ - 'grav': gradient gravity potential encoding โ
โ (GravityEncoding) (default: abs) โ
โ --wrap.statuses {no,ohe,cat} โ
โ add pedestrians statuses to obeservation as one-hot-encoded โ
โ columns. NOTE: this value has no effect when โ
โ `positions`='grad' is selected. (default: no) โ
โ --wrap.type {Dict,Box} concatenate Dict-type observation to a Box-type observation โ
โ (with added statuses to the observation) (default: Dict) โ
โ --wrap.alpha FLOAT alpha parameter of GravityEncoding. The value of alpha โ
โ determines the strength and shape of the potential function. โ
โ Higher value results in a stronger repulsion between the โ
โ agent and the pedestrians, a lower value results in a weaker โ
โ repulsion. Typical expected values vary from 1 to 5. โ
โ (default: 3) โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
## MODEL PARAMETERS:
โญโ optional subcommands โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ select the config of model (default: model:type) โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ [{model:sb-config,model:clean-rl-config,model:type}] โ
โ model:sb-config Stable Baselines Model Config โ
โ model:clean-rl-config โ
โ Clean RL Model Config โ
โ model:type โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโ model.agent options โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ select the parametrs of trainig the agent โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ --model.agent.exp-name STR โ
โ the name of this experiment (default: rpo-agent) โ
โ --model.agent.seed INT โ
โ seed of the experiment (default: 1) โ
โ --model.agent.torch-deterministic, --model.agent.no-torch-deterministic โ
โ if toggled, `torch.backends.cudnn.deterministic=False` (default: True) โ
โ --model.agent.cuda, --model.agent.no-cuda โ
โ if toggled, cuda will be enabled by default (default: True) โ
โ --model.agent.total-timesteps INT โ
โ total timesteps of the experiments (default: 80000000) โ
โ --model.agent.learning-rate FLOAT โ
โ the learning rate of the optimizer (default: 0.0003) โ
โ --model.agent.num-envs INT โ
โ the number of parallel game environments (default: 3) โ
โ --model.agent.num-steps INT โ
โ the number of steps to run in each environment per policy rollout (default: 2048) โ
โ --model.agent.anneal-lr, --model.agent.no-anneal-lr โ
โ Toggle learning rate annealing for policy and value networks (default: True) โ
โ --model.agent.gamma FLOAT โ
โ the discount factor gamma (default: 0.99) โ
โ --model.agent.gae-lambda FLOAT โ
โ the lambda for the general advantage estimation (default: 0.95) โ
โ --model.agent.num-minibatches INT โ
โ the number of mini-batches (default: 32) โ
โ --model.agent.update-epochs INT โ
โ the K epochs to update the policy (default: 10) โ
โ --model.agent.norm-adv, --model.agent.no-norm-adv โ
โ Toggles advantages normalization (default: True) โ
โ --model.agent.clip-coef FLOAT โ
โ the surrogate clipping coefficient (default: 0.2) โ
โ --model.agent.clip-vloss, --model.agent.no-clip-vloss โ
โ Toggles whether or not to use a clipped loss for the value function, as per the โ
โ paper. (default: True) โ
โ --model.agent.ent-coef FLOAT โ
โ coefficient of the entropy (default: 0.0) โ
โ --model.agent.vf-coef FLOAT โ
โ coefficient of the value function (default: 0.5) โ
โ --model.agent.max-grad-norm FLOAT โ
โ the maximum norm for the gradient clipping (default: 0.5) โ
โ --model.agent.target-kl {None}|FLOAT โ
โ the target KL divergence threshold (default: None) โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโ subcommands โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ select the network params โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ {model.network:rpo-linear-network-config,model.network:rpo-transformer-embedding-conโฆ โ
โ model.network:rpo-linear-network-config โ
โ model.network:rpo-transformer-embedding-config โ
โ RPO agent network with transforment encoding โ
โ model.network:rpo-deep-sets-embedding-config โ
โ RPO agent network with deep sets encoding โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Outputs are to be saved in following directories / files:
-
saved_data/giff/
$\rightarrow$ episode trajectoriy in giff -
saved_data/png/
$\rightarrow$ episode trajectory in png -
saved_data/models/
$\rightarrow$ trained models -
saved_data/logs/
$\rightarrow$ ${exp_name}.txt
log of episode trajectories -
saved_data/tb_logs/
$\rightarrow$ tensorboard
logs -
saved_data/config/
$\rightarrow$ ${exp_name}.yaml
config of current experiment -
wandb/
$\rightarrow$ wandb
logsExample of logging of conducted experiment