PPO-Humanoid

This repository contains the implementation of a Proximal Policy Optimization (PPO) agent to control a humanoid in the OpenAI Gymnasium Mujoco environment. The agent is trained to master complex humanoid locomotion using deep reinforcement learning.

Results

Here is a demonstration of the agent's performance after training for 3000 epochs on the Humanoid-v4 environment.

Installation

To get started with this project, follow these steps:

Clone the Repository:

git clone https://github.com/ProfessorNova/PPO-Humanoid.git
cd PPO-Humanoid

Set Up Python Environment: Make sure you have Python installed (tested with Python 3.10.11).
Install Dependencies: Run the following command to install the required packages:
```
pip install -r requirements.txt
```
For proper PyTorch installation, visit pytorch.org and follow the instructions based on your system configuration.
Install Gymnasium Mujoco: You need to install the Mujoco environment to simulate the humanoid:
```
pip install gymnasium[mujoco]
```
Train the Model: To start training the model, run:
```
python train.py --run-name "my_run"
```
To train using a GPU, add the --cuda flag:
```
python train.py --run-name "my_run" --cuda
```
Monitor Training Progress: You can monitor the training progress by viewing the videos in the videos folder or by looking at the graphs in TensorBoard:
```
tensorboard --logdir "logs"
```

Description

Overview

This project implements a reinforcement learning agent using the Proximal Policy Optimization (PPO) algorithm, a popular method for continuous control tasks. The agent is designed to learn how to control a humanoid robot in a simulated environment.

Key Components

Agent: The core neural network model that outputs both policy (action probabilities) and value estimates.
Environment: The Humanoid-v4 environment from the Gymnasium Mujoco suite, which provides a realistic physics simulation for testing control algorithms.
Buffer: A class for storing trajectories (observations, actions, rewards, etc.) that the agent collects during interaction with the environment. This data is later used to calculate advantages and train the model.
Training Script: The train.py script handles the training loop, including collecting data, updating the model, and logging results.

Usage

Training

You can customize the training by modifying the command-line arguments:

--n-envs: Number of environments to run in parallel (default: 48).
--n-epochs: Number of epochs to train the model (default: 3000).
--n-steps: Number of steps per environment per epoch (default: 1024).
--batch-size: Batch size for training (default: 8192).
--train-iters: Number of training iterations per epoch (default: 20).

For example:

python train.py --run-name "experiment_1" --n-envs 64 --batch-size 4096 --train-iters 30 --cuda

All hyperparameters can be viewed either with python train.py --help or by looking at the parse_args() function in train.py.

Performance

Here are the specifications of the system used for training:

CPU: AMD Ryzen 9 5900X
GPU: Nvidia RTX 3080 (12GB VRAM)
RAM: 64GB DDR4
OS: Windows 11

The training process took about 5 hours to complete 3000 epochs on the Humanoid-v4 environment.

Hyperparameters

The hyperparameters used for training are as follows:

param	value
run_name	baseline
cuda	True
env	Humanoid-v4
n_envs	48
n_epochs	3000
n_steps	1024
batch_size	8192
train_iters	20
gamma	0.995
gae_lambda	0.98
clip_ratio	0.1
ent_coef	1e-05
vf_coef	1.0
learning_rate	0.0003
learning_rate_decay	0.999
max_grad_norm	1.0
reward_scale	0.005
render_epoch	50
save_epoch	200

Statistics

Performance Metrics:

The following charts provide insights into the performance during training:

Reward:

As seen in the chart, the agent's average reward is still increasing after 3000 epochs, indicating that the agent has not yet reached its full potential and could benefit from further training.
Policy Loss:
Value Loss:

In the chart above, the value loss first increases and then decreases until it plateaus after 100M steps. This behavior is expected as the agent first explores the environment and then learns to predict the value of states more accurately.
Entropy Loss:

qasimk30/PPO-Multiagent-Communication