This is reimplementation of Trajectory Transformer, introduced in Offline Reinforcement Learning as One Big Sequence Modeling Problem paper.
The original implementation has few problems with inference speed, namely quadratic attention during inference and sequential rollouts. The former slows down planning a lot, while the latter does not allow to do rollouts in parallel and utilize GPU to the full.
Still, even after all changes, it is not that fast compared to traditional methods such as PPO or SAC/DDPG. However, the gains are huge, what used to take hours now takes a dozen minutes (25 rollouts, 1k steps each, for example). Training time remains the same, though.
During beam search we're only predicting one token at a time. So with the naive implementation model will make a lot of unnecessary computations to recompute attention maps for full past context. However it is not necessary, as it was already computed when the previous token was predicted. All we need is to cache it!
Actually, attention caching is a common thing in NLP field, but a lot of RL practitioners may not be familiar with NLP, so the code also can be educational.
Vectorized environments allow batching beam search planning and select actions in parallel, which is a lot faster if you need to evaluate agent on number of episodes (or seeds) during training.
I trained it on D4RL medium datasets to validate that everything is OK. Scores seem to be very close to the original. Pretrained models are available.
All training parameters can be seen in training configs. Also, all datasets for D4RL Gym tasks are supported.
python scripts/train.py --config="configs/medium/halfcheetah_medium" --device="cuda" --seed="42"
Available evaluation parameters can be seen in validation config. Here parameters are set to match evaluation configs from original implementation by @jannerm.
# you can override every config value from command line
# halfcheetah-medium-v2
python scripts/eval.py \
--config="configs/eval_base.yaml" --device="cuda" --seed="42" \
checkpoints_path="pretrained/halfcheetah" \
beam_context=5 \
beam_steps=5 \
beam_width=32
# hopper-medium-v2
python scripts/eval.py \
--config="configs/eval_base.yaml" --device="cuda" --seed="42" \
checkpoints_path="pretrained/hopper" \
beam_context=5 \
beam_steps=15 \
beam_width=128
# walker2d-medium-v2
python scripts/eval.py \
--config="configs/eval_base.yaml" --device="cuda" --seed="42" \
checkpoints_path="pretrained/walker2d" \
beam_context=5 \
beam_steps=15 \
beam_width=128
@inproceedings{janner2021sequence,
title = {Offline Reinforcement Learning as One Big Sequence Modeling Problem},
author = {Michael Janner and Qiyang Li and Sergey Levine},
booktitle = {Advances in Neural Information Processing Systems},
year = {2021},
}