Metis

Metis is a minimalist library for training RL agents in PyTorch. It implements many common training algorithms, with a focus on actor-critic methods. Includes SAC, TD3, PPO, A2C, VPG.

Why the name 'metis'?

The meaning is three-fold:

The Greek word metis meant a quality that combined wisdom and cunning. Metis feels like an apt description for the goal of RL -- find a cunning way to gain wisdom about a particular environment.
Metis was also a Titaness of Greek mythology, known as the embodiment of "prudence", "wisdom", or "wise counsel". Again, sounds like a good description for what RL aspires to be.
"Metis" sounds vaguely similar to "meta", as in meta-learning. For those out there (which definitely includes myself) who need something simpler to remember.

NOTE: I've been told that "metis" is actually pronounced (mee-tis), which kind of squashes the third interpretation. But I imagine that some folks will still pronounce it (meh-tis) like I do.

Philosophy

There are lots of RL libraries out there. In my experience, many of them are unnecessarily complicated, which makes them a nightmare to use. Others are much nicer (e.g. OpenAI's spinningup), but they are not designed for general engineering applications -- they are not so easily "hackable". Metis was started as a personal project, with the goal of creating a general-purpose RL library that is easy to use and understand (as much as is possible for RL algorithms).

Guiding development goals, in order of importance:

Usability
Hackability
Simplicity
Efficiency

Organization

Motivated by goals (1) and (2) above, each training algorithm is completely independent of the others. They do not inherit methods from any parent class, and each independently defines its own update and loss functions. At times, this might seem wasteful, because significant amounts of code are repeated. We certainly could define (semi-)generic parent classes for on-policy and off-policy trainers (or for all generic trainers), which might make the code less redundant. In practice, however, RL algorithms are difficult to write in a completely agnostic way. We would need to create additional class methods to handle the differences between algorithms (e.g. number of critic networks, number of target networks, rules for updates, etc.), which would reduce the readability and hackability of the code.

For the above reasons, relatively few abstractions are used (e.g. parent classes, abstract methods), which makes the code as explicit as possible. I believe this makes it more usable for real-world applications. In reality, I expect users to extract the bits and pieces they need, and adapt them to new use cases. I'm not sure it would be possible to write an RL library generic enough for every use case -- or at the very least, I'm not clever enough to do it. As I tell myself many days, "Keep it simple, stupid."

Getting Started

Metis tries to be as user-friendly as possible, without reducing hackability of the overall project. Training your first RL agent can be done in just a few lines of code:

import gym
from metis import agents
from metis.trainers import PPO

env = gym.make("Pendulum-v0")
# Create generic actor/critic modules for the given environment.
actor = agents.actor(env)
critic = agents.critic(env)
trainer = PPO(env)
trainer.train(actor, critic)

GPU execution is also supported out-of-the-box. Simply push your RL agents to the desired device, and the trainer will handle the rest.

actor.cuda()
critic.cuda()
trainer.train(actor, critic)

Training on multiple GPUs is only slightly more work. We use DataParallel from the PyTorch API to specify which devices to run on. Again, there are no changes needed for the trainer object.

from torch.nn import DataParallel

# Assumes that two GPUs are available with device IDs: 0, 1
dp_actor = DataParallel(actor, device_ids=[0, 1])
dp_critic = DataParallel(critic, device_ids=[0, 1])
trainer.train(dp_actor, dp_critic)

In the future, we hope to also support distributed training. Although we could perform forward/backward passes in a distributed way using DistributedDataParallel, it wouldn't really help very much, because the training environment still would not be distributed. It's possible that the dask or ray libraries provide a simpler solution to this problem, but for now we'll just stick to single-machine training.

Finally, all policies that derive from metis.agents.Actor can be visualized using the metis.play method. A game window will be constructed, and the agent interacts with the environment until a done flag is encountered.

from metis import play

play(env, actor)

Examples

In addition to the code snippets above, several example training scripts are included in the examples folder. They are very minimal and don't involve any callbacks.

Algorithms

Name	Discrete	Continuous	Actor-Critic	Experience Replay
SAC: Soft Actor-Critic	☑	☑	☑	☑
TD3: Twin-Delayed Deep Deterministic Policy Gradients		☑	☑	☑
DDPG: Deep Deterministic Policy Gradients		☑	☑	☑
PPO: Proximal Policy Optimization	☑	☑	☑
A2C: Advantage Actor-Critic	☑	☑	☑
VPG: Vanilla Policy Gradients	☑	☑	☑
DQN: Deep Q-Network	☑	☑		☑
DDQN: Double Deep Q-Network	☑	☑		☑

NOTE: The definition of VPG here differs from some other APIs. Others (e.g. spinup) define VPG using both an actor and critic network, which is equivalent to A2C in this repository. Our VPG does not contain a critic network, since this aligns more closely with the original REINFORCE paper.

Future Work

Add documentation on how to modify existing trainers or build your own
Add more callback functions for logging training info, early stopping, etc.
Add documentation to README on defining custom actors/critics

fkodom/metis