Robotini racing simulator + deep reinforcement learning

Self-driving bots for the Robotini racing simulator with deep reinforcement learning (RL). The deep RL algorithm used to train these bots is deep (recurrent) deterministic policy gradients (DDPG/RDPG) from the TensorFlow Agents framework.

Simulator socket connection and camera frame processing code is partially based on the codecamp-2021-robotini-car-template project (not (yet?) public). However, this project moves all socket communication to separate OS processes and implements message passing with concurrent queues. This is to avoid having the TensorFlow Python API runtime compete for thread time with the socket communication logic (because GIL). All custom multiprocessing scribble could probably be replaced with a few lines external library code, e.g. Celery?.

Demo

Screen recordings from an 20 hour training session.

Untrained agent

At the beginning, the agent has no policy for choosing actions based on inputs:

eval-step0.mp4

After 2000 training steps

After 2 hours of training, the agent has learned to avoid the walls but is still making mistakes:

eval-step2000.mp4

After 16000 training steps

After 20 hours of training, the agent is driving somewhat ok:

eval-step16000.mp4

Metrics

TensorBoard provides visualizations for all metrics (only 2 shown here):

Here we can see how the average episode length (number of steps from spawn to crash or full lap) and average return (discounted sum of rewards for one episode) improve over time. Horizontal axis is the number of training steps.

Note

By looking at the metrics at step 16k, we can see a global minimum of approx. 400 avg. episode length and 20 avg. return during evaluation. Therefore, the agent driving car env7_step16000_eval seen in the video under "After 16000 training steps" might not be the optimal choice for a trained agent.

This repository contains an agent from training step 11800.

Live monitoring

web-ui.mp4

Observation (neural network input)

The neural network sees 8 RGB pixels as input at each time step, also called the observation. These are computed by averaging 4 non-overlapping row and column groups of the input camera frame:

Exploration

Experience gathering is done in parallel:

explore-step16000.mp4

Requirements

Linux or macOS
Robotini simulator - fork that exports additional car data over the logging socket.
Unity - for running the simulator
Redis - Fast in-memory key-value store, used as a concurrent cache for storing car state.
Python >= 3.6 + all deps from requirements.txt

Running the training from scratch

Training a new agent can take several hours even on fast hardware due to the slow, real-time data collection. There are no guarantees that an agent trained on one track will work on other tracks.

If you still want to try training a new agent from scratch, here's how to get started:

Start the simulator (use the fork) using this config file.
If using a single machine for both training and running the simulator, set SIMULATOR_HOST := localhost in the Makefile. Otherwise, set the address of the simulator machine.

On the training machine, run these commands in separate terminals

make start_redis
make train_ddpg

Optional

make start_tensorboard
make start_monitor
Open two tabs in a browser and go to:

localhost:8080 - live feed
localhost:6006 - TensorBoard

Notes and nice-to-know things

./tf-data contains TensorBoard logging data, saved agents/policies, checkpoints, i.e. all training state. It can (should) be deleted between unrelated training runs. Note: If a long training session produced a good agent, remember to backup it from ./tf-data/saved_policies before deleting. Perhaps even back up the whole tf-data directory.
Interrupting or terminating the training script may (rarely) leave processes running. Check e.g. with ps ax | grep 'scripts/train_ddpg.py' or pgrep -f train_ddpg.py and clean unwanted processes with e.g. pkill -ef 'python3.9 scripts/train_ddpg.py' (fix pattern depending on ps ax output).
Every N training steps, the complete training state is written to ./tf-data/checkpoints. This makes the training state persistent and allows you to continue training from the saved checkpoint. Note that this includes the full, pre-allocated replay buffer, which might make each checkpoint up to 2 GiB in size.
For end-to-end debugging of the whole system, try make run_debug_policy. This will run a trivial hand-written policy that computes the amount of turn from RGB colors (see robotini_ddpg.features.compute_turn_from_color_mass).
If the training log gets filled with camera frame buffer is empty, skipping step warnings, then the TensorFlow environment step loop is processing frames faster than we read from the simulator over the socket. Some known causes for this is that the simulator is under heavy load or the machine is sleeping. Also, the simulator dumps spectator logs to race.log, which can grow large if the simulator is running for a long time.
I have no idea if the hyperparameters in ./scripts/config.yml are optimal. Some tweaking perhaps could improve the results and training speed drastically.

matiaslindgren/robotini-ddpg