This project uses policy gradient methods such as PPO or TRPO along with Generative Adversarial Networks to achieve Imitation Learning on discrete gym environments.
Methodology used here is explained in Generative Adversarial Imitation Learning (GAIL) [paper]
Given an Expert Policy as input the GAIL algorithm uses Policy Gradient method like PPO (in this case) to achieve Imitation Learning and in most cases the learned policy gets better than the input Expert Policy.
For more information why we choose this methodology over other algorithms read the report: GAIL to solve Discrete environments
- Run PPO algorithm - run the PPO algorithm on an environment
- Create Actor-Critic architecture which represents the two policy networks
- Code the PPO algorithm
- Train an agent using the PPO algorithm
- Sample trajectories - Sample some trajectories which represents the Expert Policy which we later use to train our agent for Imitation learning
- Restore the agent policy network weights
- Sample some state and action using the expert policy
- Save the sampled states and actions into csv files
- Test Expert Policy - Test the learned expert policy to see if it satisfies the criteria for solving the environment (render the runs if you want)
- Train agent using GAIL for imitation learning - given the expert trajectories as input we use Generative Adversarial Imitation Learning to train the agent
- Create a Discriminator that differentiates between the Expert Policy and Generated Policy (same as in a conventional Generative Adversarial Network)
- Train the agent to learn by imitating the given expert policy (uses GAIL algorithm)
- Run Baseline implementations of PPO and TRPO to compare performance with our implementations
- Observe reward plots on Tensorboard - the tensorboard contains the following plots :-
- Our PPO implementation's Rewards and Lengths
- Expert Policy Testing plot
- GAIL reward and lengths plot (final agent)
- Baseline reward, length and loss plots for comparison
Note - We can use any algorithm to obtain expert policy for GAIL agent training. Also, we can use other policy gradient methods like TRPO in place of PPO in the GAIL algorithm to obtain our imitating agent. However, the performance may vary depending on the algorithm choosen.
- Tensorflow (faster if you have GPU support enabled)
- OpenAI gym
- numpy
- Run Jupyter notebook or Jupyter lab
- Open GAIL.ipynb file
- Follow the instructions in the notebook to run the project
- Follow the instructions in the notebook to generate and observe the plots on Tensorboard
- Generative Adversarial Imitation Learning [paper]
- OpenAI baselines GAIL
- Tensorflow implementation of Generative Adversarial Imitation Learning(GAIL) with discrete action
- Simple GAIL implementation using Tensorflow