Usage of genetic algorithms to train a neural network in the multiple OpenAI gym environments.

Random Carracing	GA Carracing

Random BipedalWalker-v2	GA BipedalWalker-v2	Small BipedalWalker-v2*

* -- see more here

Random Cartpole-v0	GA Cartpole-v0

GA SlimeVolley-v0

The Slime Volleyball environment turned out to be quite hard for proposed method. The interesting fact to observe is that trained agent (on the right) tries to imitate movement of its opponent.

Overview
Usage
Details
Extra experiments
- Model compression and knowledge distillation
- Multitask learning
Environments

Overview

The project aims to train neural networks using genetic algorithms. Instead of minimalizing the cost function using common optimizers such as: SGD or Adam the simple GA was used. The algorithm was trying to alter the weights and biases of the neural network to achieve the best score of the fitness function. The profound description of genetic algorithms as well as the used environments are shown below.

Usage

Preliminary

Create virtual environment (python3 -m venv name-of-venv && source name-of-venv/bin/activate) and install all necessary packages (pip install -r requirements.txt).

Test pretrained model

The pretrained models are located in models/{name_of_environment} directory. For example to check how the model performs in BipedalWalker environment, specify name of the pretrained model in tests/bipedalwalker/testing_model_bipedalwalker.py script (also one have to adjust the model architecture) and run that script.

Train neural models

Due to my limited computing resources, in training neural models I have used Spell platform (I really recommend it for smaller project)

spell run "python scripts/spell/bipedalwalker_mlp_spell.py" --pip-req requirements.txt

For Carracing environment (need to open virtual display server)

spell run --apt python-dev --apt cmake --apt zlib1g-dev --apt libjpeg-dev \
	--apt xvfb --apt ffmpeg --apt xorg-dev --apt python-opengl --apt libboost-all-dev \
	--apt libsdl2-dev --apt swig \ 
	"Xvfb :1 -screen 0 1024x768x24 -ac +extension GLX +render -noreset &> xvfb.log ; export DISPLAY=:1 ; python scripts/spell/carracing_conv_spell.py" \ 
	--pip-req requirements.txt

Details

The question arises, since the main objective of neural network is to minimalize the cost function, why can't we use the genetic algorithms instead of SGD or Adam optimizer? The approach, used in this project, assumes application of simple GA in order to train the neural networks in the multiple OpenAI Gym environments.

From neural network to genotype

The figure on the left hand-side depicts the steps in order to turn the weights and biases of the neural network into the genotype. Firstly, the matrices are flattened and the biases vectors are concatenated to them. After one episode of the genetic algorithm, the genotype (vector) is converted into the input neural network architecture (the figure on the right hand-side).

Genetic Algorithm

The vast number of genetic algorithms are constructed using 3 major operations: selection, crossover and mutation. In those experiments I checked many different types of the mentioned algorithms.

Setup -- As in many various generic algorithms at the beginning the population of individuals (neural networks) are created. The weights and biases of them are initialized randomly.
Selection -- For each individual the fitness function are calculated. I tested both the ranking and the roulette wheel selection and the former method worked significantly better. As the result, the two individuals are selected with the highest fitness score.
Crossover -- The 'parents' (two neural networks) are decomposed into the flat vectors and then the simple or the BLX-alpha crossover is performed.
Mutation -- Each made child has a chance to mutate i.e., alter the weights or biases. I found out that bigger mutation (uniform distribution with large range) is vital to achieve greater results.
Repeat -- If the sum of fitness score of the created children is greater than theirs parents, the children go to the next generation. The process starts from the beginning (step 2.) until the number of the generations reached the limit or the fitness score was satisfied.

Fitness function

The calculation of the fitness function varies, it depends on the environment. However, the basic example to compute is as follows:

env = gym.make("your environment")

def compute_fitness_function(env, model, n_episodes: int):
  obs = env.reset()
  fitness = 0
  for n in n_episodes:
    action = model(obs)
    obs, reward, done, info = env.step(action)
    fitness += reward
    if done:
      break
  return fitness


Mean value of the fitness function for BipedalWalker-v2 problem

The fitness function in this example is non monotonic and the variance for the generation above 1000 is sizeable. These situations were common during training.

Extra experiments

Model compression and knowledge distillation

Pruning trained neural network

Can we reduce the sizes of the neural network while keeping good performance in the particular environment? The first step was to gather the training data for the smaller (student) model i.e. I took the best model (from BipedalWalker environment) and ran it multiple times whilst saving both the features (observations from environment) and the labels (models' actions). The initial neural model architecture was shrinkage significantly, from 24-20-12-12-4 to 4-4-4, interestingly, preserving the ability to complete the task.

Multitask learning


Example scheme for training the agent in two environments	Example model architecture for multitask learning

I used Hard parameters sharing approach to train both the CartPole and BipedalWalker agents. The figures above depict how the neural model architecture looks like.

I also checked how the trained model will behave, if I fed it with different input values, that is:

Both agents receive correct observations from environments
CartPole agent receives random noise from uniform distribution, while BipedalWalker agent receives correct observations
CartPole agent receives correct observations, while BipedalWalker agent receives random noise from uniform distribution
Both agents receive random noise from uniform distribution.

The results are shown below. To conclude, the both model inputs are important, but not equally. It is clearly visible that for 2. option, despite noised observations in CartPole, the BipedalWalker agent performs reasonable well.


1. BipedalWalker	1. CartPole

2. BipedalWalker	2. CartPole

3. BipedalWalker	3.CartPole

4. BipedalWalker	4. CartPole

Environments

CarRacing-v0	Architecture
Observation space: - RGB 96x96 pixels frame (with all the control bar on the bottom of the screen)
Actions(continues): - sterring angle (s = [-1, 1]) - throttle (t = [0, 1]) - brake (b = [0, 1])
Reward - (-0.1) every frame and +1000/N for every track tile visited, where N is the total number of tiles in track
Genetic algorithm parameters - population size: 50 - generation: 2000 - mutation rate: 0.4 - crossover rate: 0.8

BipedalWalker-v2	Architecture of neural network
Environment (continuous) - 24 observations (hull_angle, vel_x, vel_y and many more)
Actions (continues): - Hip_1 and Hip_2 (Torque / Velocity) - Knee_1 and Knee_2 (Torque / Velocity)
A reward is given for moving forward, total of 300+ points up to the far end. If the robot falls, it gets -100
Episode Termination The episode ends when the robot body touches ground or the robot reaches the far right side of the environment
Genetic algorithm parameters - population size: 50 - generation: 2000 - mutation rate: 0.3 - crossover rate: 0.9

Cartpole-v0	Architecture of the NN
Environment (continuous) - 4 observations (Cart Position and Velocity, Pole Angle and Velocity at Tip)
Actions (discrete): - Push a cart to the left or right
Reward - 1 for every step taken, including the termination step
Episode Termination - Pole Angle is more than ±12° - Cart Position is more than ±2.4 (center of the cart reaches the edge of the display) - Episode length is greater than 200
Genetic algorithm parameters - population size: 100 - generation: 20 - mutation rate: 0.4 - crossover rate: 0.9

SlimeVolley-v0	Architecture of the neural network
Environment (continuous) - 12-dim vector with position and velocity of each agents and the ball
Actions (discrete): - 3-dim vector (go left, right or jump)
Reward For that task I designed custom fitness function: $R = \sum_{i=0}^{elapsed\_episodes} \left( reward_i + \gamma \cdot agent\_lives + \beta \cdot (total\_lives - enemy\_lives) \right) + \alpha \cdot elapsed\_episodes$
Episode Termination The episode ends when either the agent losses all five lives, or after 3000 timesteps have passed
Genetic algorithm parameters - Population: 30 - Number of generations: 2000 (the best model at iteration 1824) - Mutation rate: 0.1 - Crossover rate: 0.8

robertjankowski/ga-openai-gym