Collective Reinforcement Learning Dynamics in Python
is a tool to model the collective dynamics emerging from multi-agent reinforcement learning.
Multi-agent reinforcement learning (MARL) provides a comprehensive framework for studying the interplay among learning, representation, and decision-making between multiple actors. As a result, it offers an integrating platform to in-silico test hypotheses and build theory on how different cognitive mechanisms affect collective adaptive behavior in complex environments.
In combination with advances in machine learning, particularly deep learning, modern MARL has produced spectacular successes in high-dimensional environments. However, standard RL simulations have significant disadvantages for modeling the collective behavior emerging from MARL: they are noisy, sometimes hard to explain, sample-inefficient, and computationally intense.
CRLD offers a solution in two ways of idealization. First, CRLD aims to understand the principles behind collective behavior in idealized, low-dimensional environments. Second, CRLD concentrates on the essence of the stochastic and computationally intense reinforcement learning algorithms by deriving their strategy-average, deterministic learning equations.
In a nutshell, reinforcement learning agents strive to improve the
rewards they receive while interacting with the environment. In each
time step, they asses a sample of their current reward-prediction
error
The key idea of CRLD is to replace the individual sample realizations with its strategy average plus a small error term,
One can interpret these learning dynamics from a cognitive and an
engineering perspective. In the limit of a vanishing error term,
How to put these ideas into practice?
First, let’s install the package from github:
pip install "git+https://github.com/wbarfuss/pyCRLD.git"
Second, we create a minimal example of a phase space portrait of the learning dynamics in a classic social dilemma environment:
from pyCRLD.Agents.StrategyActorCritic import stratAC
from pyCRLD.Environments.SocialDilemma import SocialDilemma
from pyCRLD.Utils import FlowPlot as fp
import numpy as np
import matplotlib.pyplot as plt
# Init enviornment and MultiAgentEnvironment-interface
env = SocialDilemma(R=1.0, T=0.8, S=-0.5, P=0.0)
mae = stratAC(env=env, learning_rates=0.1, discount_factors=0.9)
# Compute learning trajectory
x = mae.random_softmax_strategy() # from a random inital strategy
xtraj, fixedpointreached = mae.trajectory(x)
# PLOT
fig, axs = plt.subplots(1,2, figsize=(9,4))
plt.subplots_adjust(wspace=0.3)
# Plot in phase space
x = ([0], [0], [0])
y = ([1], [0], [0])
ax = fp.plot_strategy_flow(mae, x, y, flowarrow_points = np.linspace(0.01 ,0.99, 9), axes=[axs[0]])
fp.plot_trajectories([xtraj], x, y, cols=['purple'], axes=ax);
ax[0].set_xlabel("Agent 0's cooperation probability")
ax[0].set_ylabel("Agent 1's cooperation probability");
ax[0].set_title("Flowplot")
# Plot in trajectory
axs[1].plot(xtraj[:, 0, 0, 0], label="Agent 0", c='red')
axs[1].plot(xtraj[:, 1, 0, 0], label="Agent 1", c='blue')
axs[1].set_xlabel('Time steps')
axs[1].set_ylabel('Cooperation probability')
axs[1].legend()
axs[1].set_title("Trajectory");
We see how the learning trajectories on the right result from the flow on the left, which suggests that in this environment, mutual cooperation and mutual defection are viable solutions, depending on the initial cooperation levels of both agents.