This repository contains my PhD thesis in the pdf format.
Diversity Based Exploration for Deep Reinforcement Learning.
This thesis is concerned with the family of methods at the intersection between deep reinforcement learning and evolutionary algorithms for solving problems where the algorithms ability to explore, in particular their management of the exploration-exploitation dilemma, plays a crucial role. Historically, deep reinforcement learning algorithms are recognized as being efficient in terms of sample efficiency, but have a limited ability to explore the space of possible solutions, due to their simple exploration mechanisms, often based on the addition of stochastic noise in the action space. On the other hand, evolutionary methods, and more particularly recent algorithms from the quality-diversity family, have a superior ability to explore, due to their exploration mechanisms based on the diversity of solutions within a population. They are capable of solving optimization problems where it is necessary to explore the solution space intelligently (by defining a subspace of interest called the behavior space), but are often costly in terms of interactions with the environment.
The first part of the contributions of this thesis (Chapter 3) focuses on the development of an hybrid algorithm, called Quality-Diversity-Policy-Gradient (QD-PG), for solving difficult exploration problems in continuous control environments (simulated robotics), based on the algorithmic framework of quality-diversity methods, and aiming to make it sample efficient using gradient-based methods derived from reinforcement learning. In the second part (Chapter 4), we present: 1. A new quality-diversity algorithm, called MAP-Elites Low-Spread, which corrects the variance bias of the MAP-Elites algorithm and generates constant and regular solutions in the behavior space, 2. A method based on supervised deep learning for distilling a collection of solutions generated by MAP-Elites Low-Spread into a single deep neural network based on the Transformer architecture, which is capable of generating trajectories conditioned on a desired behavior with high accuracy. Finally, the last contribution part (Chapter 5) introduces a work in progress, where we propose using a Transformer-based model to predict the final state of continuous cellular automata from an initial state, without prior knowledge of their governing rules. We hypothesize that this model could be employed to automatically detect interesting patterns generated by a search algorithm.