A car-agent leans to navigate in complex traffic conditions by PPO.
The agent is supposed to learn to choose accelerations to reach the destination avoiding causing jam or collision.
When a car-agent navigates on the road, it may encounter with other cars.
In some conditions, the acceleration chosen by car-agent will cause jam or collision.
Since the condition will come very complex and the GAMA simulator has no idea about the collision so I have to make collision detection or jam detection.
Here will choose the closest 10 cars around the agent and calculate the distances.
These equations are neccessary. And here will use Euclidean distance for safe driving.
First, the agent compute the useful distances (There will be distance of the behind car or distance of the front car).
And then detections will be executed after the agent choose acceleration to detecte whether the acceleration will cause jams or collisions.
A unit of time is 1-cycle.
When there is an another car is in front of the car-agent when the two cars are on the same road, if
the acceleration will be supposed to cause collision with the front cars. (The front cars maybe more than one.)
When there is an another car is behind of the car-agent when the two cars are on the same road, if
the acceleration will be supposed to cause jam with the behind cars. (The behind cars maybe more than one.)
The calculation process is the same as the conditions on the same road.But the conditions become very complex.
The closest 10 cars will on the same road with the agnet?
If so, will the cars be the front of the agent or behind of the agent?
These conditions will be detected clear in the gaml file.
[real_speed/10, target_speed/10, elapsed_time_ratio, distance_to_goal/100,distance_front_car/10,distance_behind_car/10]
The network's output are accelerations which are constricted between [-5,8]m/s^2 to be closer to the real situations.
Output acceleration.
Action representation [acceleration].
The car will learn to control its acceleration with the restructions shown below:
Reward shaping:
- rt = r terminal + r danger + r speed
- r terminal: -0.013(target_speed > real_speed) or -0.1(target_speed < real_speed):crash / time expires
- r speed: related to the target speed
- if sa ≤st:0.001 - 0.004*((target_speed-Instantaneous_speed)/target_speed);
if distance_front_car_before <= safe_interval or time_after_safe_interval>0:0.001*(Instantaneous_speed/target_speed);
Time_after_safe_interval can be extented when the front cars within safe_interval. - if sa > st: 0.001 - 0.006*((Instantaneous_speed-target_speed)/target_speed);
In my experiment it's obviously I desire the agent to learn controling its speed around the target-speed.
It's obvoiusly that the LSTM can be trained much better than models without LSTM.
The GAMA is a platefrom to do simulations.
I have a GAMA-modle named "PPO_Mixedinput_Navigation.gaml", which is assigned a car and some traffic lights. The model will sent some data
[real_speed, target_speed, elapsed_time_ratio, distance_to_goal,reward,done,time_pass,over]
as a matrix to python environment, calculating the car's accelerate by A2C. Applying to the Markov Decision Process framework, the car in the GAMA will take up the acceleration and send the latest data to python over and over again until reaching the destination.