/DDPG

Pytorch implementation of Deep Deterministic Policy Gradients (DDPG)

Primary LanguagePythonMIT LicenseMIT

Deep Deterministic Policy Gradients (DDPG)

Overview

This repository contains an implementation of the Deep Deterministic Policy Gradients (DDPG) algorithm, as described in the paper "Continuous control with deep reinforcement learning" by Lillicrap et al, and evaluated on various standard continuous control environments from the Gymnasium and MuJoCo libraries. DDPG is an actor-critic, model-free algorithm tailored to continuous action domains. Building on the deterministic policy gradient (DPG) framework, DDPG adapts techniques from Deep Q-Network (DQN) like experience replay and the use of target networks to stablize training and handle high-dimensional, continuous action spaces. The authors also incorporate batch normalization in the actor network to manage the diverse scale of different inputs effectively, however this implementation makes use of PyTorch's LayerNorm as it is invariant to batch size and allows for a cleaner implementation of the target network parameter updates.

Setup

Required Dependencies

Install the required dependencies using the following command:

pip install -r requirements.txt

Running the Algorithm

You can run the algorithm on any supported Gymnasium environment. For example:

python main.py --env 'LunarLanderContinuous-v2'

Pendulum-v1

LunarLanderContinuous-v2

MountainCarContinuous-v0

BipedalWalker-v3

Hopper-v4

Humanoid-v4

Ant-v4

HalfCheetah-v4

HumanoidStandup-v4

InvertedDoublePendulum-v4

InvertedPendulum-v4

Pusher-v4

Reacher-v4

Swimmer-v3

Walker2d-v4

No hyper-parameter tuning was conducted for these benchmarks. This was an intentional choice to compare the generalized algorithm performance across a variety of environments. As such, there are several cases where the agent fails to the effectively learn, and others where the agent was still learning after 10k epochs. DDPG is notably brittle to starting conditions and hyper-parameter choices which can affect its perfoormance, a limitation addressed by subsequent improvments in algorithms like Soft Actor-Critic (SAC) and Proximal Policy Optimization (PPO).

Acknowledgements

Special thanks to Phil Tabor, an excellent teacher! I highly recommend his Youtube channel.