/fetch-and-slide-HRE-PRE

In this project, I attempt to solve fetch and slide open gym environment with Hindsight Experience Replay and the I experiment with Prioritised experience replay to see if there are any performance improvements

Primary LanguagePython

fetch-and-slide-HRE-PRE

In this project, I attempt to solve fetch and slide open gym environment with Hindsight Experience Replay and the I experiment with Prioritised experience replay to see if there are any performance improvements

Environments tested

  • FetchSlide-v1
  • FetchPickAndPlace-v1

Environment details

  • FetchSlide-v1

    • It is an openai robotics environment which uses mujoco physics engine simulator to handle the environment physics. The state space is continuous and the action space is 4 dimensional. The first three dimensions specify the location of agent and 4th dimension specify the distance between claws of the agent. Actions will decide where the robot hand will be in space. There are two sets of goals, Achieved Goal and Desired Goal. Achieved goal is the goal achieved after performing an action in a given state. Desired Goal is the goal which we want the agent to achieve after performing an action from a state.
    • The agent(robot arm) tries to slide a puck towards a goal location on a table in front of it. The surface of the table has some friction as well. The main aim in this environment is for the agent to learn to slide the puck towards it's desired location.
  • FetchPickAndPlace-v1

    • It is an openai robotics environment which uses mujoco physics engine simulator to handle the environment physics. The state space is continuous and the action space is 4 dimensional. The first three dimensions specify the location of agent and 4th dimension specify the distance between claws of the agent. Actions will decide where the robot hand will be in space. There are two sets of goals, Achieved Goal and Desired Goal. Achieved goal is the goal achieved after performing an action in a given state. Desired Goal is the goal which we want the agent to achieve after performing an action from a state.
    • The agent(robot arm) tries to pick up a block and place it at the goal location on front of it. The location can either be in space above the table or on the table. The main aim in this environment is for the agent to learn to pick up the block and place it at the desired location

Algorithms

  • For this project, I end up using Deep deterministic policy gradient as the off policy algorithm of choice. And then I add HER and HER+PER on top of that to learn.

Deep deterministic Policy Gradient (DDPG)

DDPG algorithm

  • Deep Deterministic Policy Gradient(Lillicrap et al., 2015) is a model free RL algorithm for continuous action spaces. In DDPG, we have a target policy(actor network) π : S− > A and an action-value function approximator(critic network) Q : SXA− > R. Critic is responsible to calculate Qπ from π generated by actor. The episodes are generated by adding a noise term to the target policy (here we experiment with Gaussian and Ornstein–Uhlenbeck(Wikipedia contributors, 2022) noise terms). Actor model is trained with mini-batch gradient descent with the loss term as −Es[Q(s, π(s))]; where s is sampled from the replay buffer.

Hindsight Experience Replay (HER)

HER algorithm

  • With Hindsight Experience Replay, the trick is the consider that we want the agent to learn multiple goals instead of a single goal.

Multi goal RL

We can achieve this by using an approach analogous to Universal Function Approximators(Schaul et al.). Instead of just storing the states, we can store state->goal pair when we record the transi- tions. So, the policies are a function of state as well as goal. We assume that there is a mapping for state and goal and new goals can be sampled from this mapping.

Hindsight Generation

After an episode ends(we reach end of a trajectory with length T), we store every transition of the form st− > st+1 not only with the original goal but also with the intermediate goals achieved along the trajectory. To decide how much Hindsight Experience I need, I use a factor k which decides the ratio of HER replay vs standard replay. The probability of selecting HER replay can be given by Pf uture = 1 − 1 (1+k)

Priortised Experience Replay (PER)

PER algorithm Prioritized Experience replay applies a prior on the standard experience replay buffer so that the trajectories that lead to a higher learning progress are preferred Figure 2: Prioritized Experience Replay Algorithm 3.2.1 Applying Prioritization to replay buffer TD error for a transition is a good measure of how surprising or unexpected a transition is which can measure the expected learning progress. However just using TD error greedily can cause initial transitions with very low TD error to never occur. It can also be very sensitive to noise when the reward are stochastic. Finally, it may be prone to over-fitting as the initial high TD error transitions will be replayed very frequently. So, following the implementation for Prioritized Experience Replay(Schaul et al., 2015), I use a stochastic prioritization which isolated between pure greedy prioritization and uniform random sampling.

Network Acrhitecture

There are 2 different kinds of networks used here namely, actor network and critic network. The actor network generates a target policy which is then used as an input for the critic network. The critic network evaluates the quality of policy generated by the actor. Both actor and critic network use adam optimizer with learning rate of 0.001

Actor Network

The actor network consists of 3 Fully Connected Layers each activated using ReLu activation func- tion. The default number of units in input and hidden layers is 256. This is followed by an action selection unit which is a fully connected layer with dimensions of (units in 3rd F ully connected layer X number of actions) .This output action selection layer is activated using a tanh activation function

Critic Network

The actor network consists of 3 Fully Con- nected Layers each activated using ReLu acti- vation function. The default number of units in input and hidden layers is 256. This is followed by an Q value output unit which is a fully connected layer with dimensions of (units in 3rd F ully connected layerX 1)

Target updates

The target networks are updated using a polyak average(Polyak and Juditsky, 1992). The target networks’ parameters contain 95% of their own parameters and 5% of the parameters of actor/critic networks.

Experiment Results

FetchSlide-v1 100 epochs playlist

Multi processing setup

  • Used Mpi4py module to use message passing capabilities within python. I use it to exploit multiple processors on my system.

  • mpiutils has two classes to synchronize data across multiple cpus

    • sync_networks:

      This function synchronises network's parameters across multiple cpus. It ensures that we can easily collect data from each network by broadcasting parameters from each network

    • sync_grads:

      This function synchronises gradients across networks by reducing flat gradients across networks

Running on google colab

  • Google colab needs some dependencies that are missing from the base python environment they provide. The following steps helped me enable openai-gym support on google colab.

    • !apt-get install -y
      libgl1-mesa-dev
      libgl1-mesa-glx
      libglew-dev
      libosmesa6-dev
      software-properties-common !apt-get install -y patchelf

      This command installs the necessary libraries to run mujoco environment like libglew and mesa.

    • !pip install gym : This command loads openai gym package into the base environment

    • !pip install free-mujoco-py: This command installs mujoco-py which enables you to run mujoco with openai gym. This is different from local mujoco installation and does not come with opencl support

    • !pip install mpi4py: This command enables mpi support for python. This is important if you want to work on a complex environment like mujoco. It helps you run different environments on different cpus and then you can gather results/ broadcast parameters across cpus.

    • !mpirun --allow-run-as-root -np 8 python3 main.py --parameter1=value --parameter2=value...: The "--allow-run-as-root" is not recommended for google colab, but I found that I couldn't run my program with mpi without this command.

Setup difficulties with m1 mac

  • There were no conda packages available for mujoco-py, so I had to install mujoco from pip. This required setting a few things before installing mujoco. The install script is in install-mujoco_dummy.sh file. Replace version of mujoco with the approporiate version in your install.
  • After creating the install script, you need to setup CC, CXX, LDFLAGS and CXXFLAGS in order to ensure mujoco runs off clang instead of gcc. The paths would be in llvm folder inside opt/homebrew/opt.
  • You might have to downgrade mujoco version in order for gym to use mujoco. This could be due to dependency issues with the c++ source files(dylib).

How to run

  • Use the folllowing command to train the model
        mpirun -np 8 python3 main.py --per=True
    
  • Use the following command to test the model
        python3 main.py --mode=test --per=True