Implementation of reinforcement learning algorithm for the sensory exploration of the rooms.
Simulation environment: multi-room simulator.
For the solution classical PPO algorithm from ray library was used. The reward function for the agent is defined as following:
reward = explored * (1 / max(np.log(step), 1))
Where:
- eplored: int # number of new cells explored on the current step
- step: int # the step number
The idea behind the reward function is a simple time-discounting of new cells, so in order to maximize the reward, the agent should maximize the number of opened cells on early steps.
- Clone the repository and install all the packages with
pip install -r requirements.txt
. - Login to
wandb
from cli and modify the parameters in theconfig.json
. - Run the script train.py.
Note: The script takes one optional argument – path to the config file, by deafult config from the repository is used. To run the script with custom config:
python3 -m train -i <path_to_your_config>
As for the experiment, the agent was trained with the following parameters:
Parameter | Value |
---|---|
Number of itterations | 500 |
Entropy coefficient | 0.1 |
Lamda | 1 |
Seed | 42 |
Max steps | 2000 |
Width & Height of a room | 20 |
Observation size | 11 |
Vision radius | 5 |
Activation function | relu |
Cell types | UNK, FREE, OCCUPIED |
Convolution layers | [[16, [3, 3], 2], [32, [3, 3], 2], [32, [3, 3], 1]] |
Wanb report with the mean_reward
and mean_episode_len
charts can be found via the link here.
Gif example of the agent moves after 500 iteranion:
As the charts of the mean reward and the episode length show, the agent learns fast during the first 13 iterations, but stucks after reaching the result of ~1200 steps and doesn't aim to finish the exploration faster as time passes by. The problem might be, that the current reward function is to abstact. On the gif above it's clearly seen that agent stucks at the same cells and repeats the same moves in a loop. It might be a good idea, to take a trajectory into account, and stimulate agent not to repeat the same moves.