Autonomous-Explorer-Drone: A Python repository from JaninaMattes

Autonomous Explorer Drone

Exploring a learning-based method to autonomous flight.
Getting started »

Table of Contents

About The Project
- Built With
Getting Started
- Prerequisites
- Installation
Usage
Roadmap
Contributing
License
Contact
Acknowledgments

About The Project

The design of a control system for an agile mobile robot in the continuous domain is a central question in robotics. This project specifically addresses the challenge of autonomous drone flight. Model-free reinforcement learning (RL) is utilized as it can directly optimize a task-level objective and leverage domain randomization to handle model uncertainty, enabling the discovery of more robust control responses. The task analyzed in the following is a single agent stabilization task.

Drone Model and Simulation

The gym-pybullet-drones environment is based on the Crazyflie 2.x nanoquadcopter. It implements the OpenAI gym API for single or multi-agent reinforcement learning (MARL).

Fig. 1: The three types of gym-pybullet-drones models, as well as the forces and torques acting on each vehicle.

Training Result

The following shows a training result where the agent has learned to control the four independent rotors to overcome simulated physical forces (e.g. gravity) by the Bullet physics engine, stabilize and go into steady flight.

Fig. 2: Rendering of a gym-pybullet-drones stable flight with a Crazyflie 2.x during inference.

PPO Actor-Critic Architecture

In this project the policy gradient method is used for training with a custom implementation of Proximal Policy Optimization (PPO).

Fig. 3: Overview of the Actor-Critic Proximal Policy Optimisation Algorithm process

The architecture consists of two separate neural networks: the actor network and the critic network. The actor network is responsible for selecting actions given the current state of the environment, while the critic network is responsible for evaluating the value of the current state.

The actor network takes the current state $s_t$ as input and outputs a probability distribution over the possible actions $a_t$. The network is trained using the actor loss function, which encourages the network to select actions that have a high advantage while also penalizing actions that deviate too much from the old policy. The loss function is defined as follows:

$$ L^{actor}(\theta) = \mathbb{E}_{t} \left[ \min\left(r_t(\theta) \hat{A}_t, \text{clip}\left(r_t(\theta), 1-\epsilon, 1+\epsilon\right) \hat{A}_t \right) \right] $$

where $r_t(\theta) = \frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}$ is the probability ratio of the new and old policies, $\hat{A}_t$ is the estimated advantage function, and $\epsilon$ is a hyperparameter that controls how much the new policy can deviate from the old policy.

The critic network takes the current state $s_t$ as input and outputs an estimate of the value of the state $V_{\theta}(s_t)$. The network is trained using the critic loss function, which encourages the network to accurately estimate the value of the current state, given the observed rewards and the estimated values of future states. The loss function is defined as follows:

$$ L^{critic}(\theta) = \mathbb{E}{t} \left[ \left(V{\theta}(s_t) - R_t\right)^2 \right] $$

where $R_t$ is the target value for the current state, given by the sum of the observed rewards and the estimated values of future states.

Action and Observation Space

The observation space is defined through the quadrotor state, which includes the position, linear velocity, angular velocity, and orientation of the drone. The action space is defined by the desired thrust in the z direction and the desired torque in the x, y, and z directions.

Reward Function

The reward function defines the problem specification as follows:

$$ \text{Reward} = \begin{cases} -5, & \text{height} < 0.02 \\ -\frac{1}{10 \cdot y_{pos}}, & \text{height} \geq 0.02 \end{cases} $$

where $y_{pos}$ is the current height of the drone. The reward function encourages the drone to maintain a certain height while also penalizing excessive movement in the y-axis.

(back to top)

PyBullet Environment & Drone

Environment

The environment is a custom OpenAI Gym environment built using PyBullet for multi-agent reinforcement learning with quadrotors.

Fig. 4: 3D simulation of the drone's orientation in the x, y, and z axes.

PID Controller

stabilize drone flight

(back to top)

Built With

The project was developed using Python and the PyTorch machine learning framework. To simulate the quadrotor's environment, the Bullet physics engine is leveraged. Further, to streamline the development process and avoid potential issues, the pre-built PyBullet drone implementation provided by the gym-pybullet-drones library is utilized.

Programming Languages-Frameworks-Tools

(back to top)

Getting Started

This is an example of how you may give instructions on setting up your project locally. To get a local copy up and running follow these simple example steps.

Requirements and Installation

This repository was written using Python 3.10 and Anaconda tested on macOS 14.4.1.

Installation

Major dependencies are gym, pybullet, stable-baselines3, and rllib

Create virtual environment and install major dependencies

 $ pip3 install --upgrade numpy matplotlib Pillow cycler 
 $ pip3 install --upgrade gym pybullet stable_baselines3 'ray[rllib]'

or requirements.txt

$ pip install -r requirements_pybullet.txt

Video recording requires to have ffmpeg installed, on macOS
```
$ brew install ffmpeg
```
or on Ubuntu
```
$ sudo apt install ffmpeg
```
The gym-pybullet-drones repo is structured as a Gym Environment and can be installed with pip install --editable
```
$ cd gym-pybullet-drones/
$ pip3 install -e .
```