Time-in-State-RL: A Python repository from UCLA Networked & Embedded Systems Laboratory - UCLA Networked & Embedded Systems Laboratory

Summary

This repo contains the code to train deep reinforcement learning policies for HalfCheetah, Ant, and the 1/18th scale autonomous car (DeepRacer). Slides, Paper

Paper: Sim2Real Transfer for Deep Reinforcement Learning with Stochastic State Transition Delays, CoRL-2020.

The deployment heterogeneities and runtime compute stochasticity results 
in variable timing characteristics of sensor sampling rates and end-to-end delays from sensing to actuation. 
Prior works have used the technique of domain randomization to enable the successful transfer of policies 
across domains having different state transition delays. We show that variation in sampling rates and 
policy execution time leads to degradation in Deep RL policy performance, and that domain randomization 
is insufficient to overcome this limitation. We propose the Time-in-State RL (TS-RL) approach, which includes 
delays and sampling rate as additional agent observations at training time to improve the robustness of Deep RL 
policies. We demonstrate the efficacy of TSRL on HalfCheetah, Ant, and car robot in 
simulation and on a real robot using a 1/18th scale car.

Domain randomization (DR) vs Time-in-State (TS)

Autonomous car at 100 ms latency


DR	TS

HalfCheetah at 20.6 ms latency


DR	TS

Ant at 20.6 ms latency


DR	TS

Demo video

Check out the quick demo of the transfer of policies from simulation to a real car robot.

Code credits

a) The fully connected policy training using PPO code is taken from open AI baselines with modifications done to allow variable timing characteristics during training by fusing the delay observations with the neural network for images/state.

b) The code to train recurrent policies using PPO with variable timing characteristics is modified from the batch PPO code available from Google-Research.

c) The HalfCheetah environment and robot are taken from the Pybullet code with modifications for the variable timing characteristics for simulation steps.

d) The Ant environment and robot are taken from the Pybullet code with modifications for the variable timing characteristics for simulation steps.

e) Deepracer simulator modified by taking a snapshot from the open-source code of deepracer available here. The changes include the track color, the captured camera sampling rate, and the removal of the AWS dependencies. The deepracer environment is modified to allow the variable timing characteristics.

Requirements:

1. Install the following requirements to train all the policies mentioned in the paper:

a) OpenAI gym, OpenAI baselines, batch-PPO

b) Gazebo and Ros for deepracer robotic car policies.

c) Pybullet for Ant and HalfCheetah

2. To train only the fully policies only for Ant and Halfcheetah:

a) OpenAI gym, OpenAI baselines

b) Pybullet for Ant and HalfCheetah

3. To train only the recurrent policies for Halfcheetah:

a) OpenAI gym, batch-PPO

b) Pybullet for HalfCheetah

4. To train only the policies for Deepracer robotic car: