Partially Observable Gym, or pogym
is a collection of Partially Observable Markov Decision Process (POMDP) environments following the Openai Gym interface. The goal of pogym
is to provide a standard benchmark for memory-based models in reinforcement learning. In other words, we want to provide a place for you to test and compare your models and algorithms to R2D2, recurrent PPO, decision transformers, and so on. pogym
has a few basics tenets that we will adhere to:
- Painless setup -
pogym
requires onlygym
andnumpy
as dependencies, and can be installed with a singlepip install
. - Laptop-sized tasks - None of our environments have large observation spaces or require GPUs to render.
- No overfitting - It is possible for memoryless agents to receive high rewards on environments by memorizing the layout of each level. To avoid this, all environments are procedurally generated.
The environments are split into set or sequence tasks. Ordering matters in sequence tasks (e.g. the order of the button presses in simon matters), and does not matter in set tasks (e.g. the "count" in blackjack does not change if you swap ot-1 and ot-k).
- Memory/Concentration (partially implemented)
- Blackjack
- Baccarat (not implemented yet)
- Higher/Lower
- Battleship (not implemented yet)
- Multiarmed Bandit
- Minesweeper (not implemented yet)
- Repeat Previous
- Repeat First
- Repeat Backwards
- Stateless Cartpole
- Stateless Pendulum
- Treasure Hunt (not implemented yet)
- Bipedal Walker
- Labyrinth Escape (not implemented yet)
- Labyrinth Explore (not implemented yet)
Steps to follow:
- Fork this repo in github
- Clone your fork to your machine
- Move your environment into the forked repo
- Install precommit in the fork (see below)
- Write a unittest in
tests/
, see other tests for examples - Add your environment to
ALL_ENVS
inpogym/__init__.py
- Make sure you don't break any tests by running
pytest tests/
- Git commit and push to your fork
- Open a pull request on github
# Step 4. Install pre-commit in the fork
pip install pre-commit
git clone https://github.com/smorad/pogym
cd pogym
pre-commit install
Casino blackjack, but unlike other environments the game is not over after the hand is dealt. The game continues until the deck(s) of cards are exhausted. The agent should learn to maintain a "count" of the cards it has seen. Using memory, it can infer what cards remain in the deck, and adjust the bet accordingly to maximize return.
Identical rules to casino baccarat, with betting. The agent should use memory to count cards and increase bets when it is more likely to win.
Guess whether the next card drawn from the deck is higher or lower than the previously drawn card. The agent should keep a count like blackjack and baccarat and modify bets, but this game is significantly simpler than either baccarat or blackjack.
One-player battleship. Select a gridsquare to launch an attack, and receive confirmation whether you hit the target. The agent should use memory to remember which gridsquares were hits and which were misses, completing an episode sooner.
Over an episode, solve a multiarmed bandit problem by maximizing the expected reward. The agent should use memory to keep a running mean and variance of bandits.
Classic minesweeper, but with reduced vision range. The agent only has vision of the surroundings near its last sweep. The agent must use memory to remember where the bombs are
Output the t-kth observation for a reward
Output the zeroth observation for a reward
The agent will receive k observations then must repeat them in reverse order.
Classic cartpole, except the velocity and angular velocity magnitudes are hidden. The agent must use memory to compute rates of change.
Classic pendulum, but the velocity and angular velocity are hidden from the agent. The agent must use memory to compute rates of change.
The agent is placed in an open square and must search for a treasure. With memory, the agent can remember where it has been and complete the episode faster.
Classic bipedal walker with procedurally-generated levels, but with a single LiDAR ray cast from the head. The agent must move the head and combine single rays over time into a representation of the environment to avoid obstacles.
Escape randomly-generated labyrinths. The agent must remember wrong turns it has taken to find the exit.
Explore as much of the labyrinth as possible in the time given. The agent must remember where it has been to maximize reward.