oxwhirl/smac

Simplify dense reward calculation if reward_only_positive is true

douglasrizzo opened this issue · 1 comments

In this part of the dense reward calculation method, the damage and deaths of ally units are accumulated into variables delta_ally and delta_deathsand used to compose the reward later. Notice how dealth_deaths is only changed if self.reward_only_positive is false:

for al_id, al_unit in self.agents.items():
if not self.death_tracker_ally[al_id]:
# did not die so far
prev_health = (
self.previous_ally_units[al_id].health
+ self.previous_ally_units[al_id].shield
)
if al_unit.health == 0:
# just died
self.death_tracker_ally[al_id] = 1
if not self.reward_only_positive:
delta_deaths -= self.reward_death_value * neg_scale
delta_ally += prev_health * neg_scale
else:
# still alive
delta_ally += neg_scale * (
prev_health - al_unit.health - al_unit.shield
)

When the reward is calculated using the previous accumulated values, delta_ally is only used if self.reward_only_positive is false. The version of delta_deaths that is altered in the ally loop above is also only used if self.reward_only_positive is false.

if self.reward_only_positive:
reward = abs(delta_enemy + delta_deaths) # shield regeneration
else:
reward = delta_enemy + delta_deaths - delta_ally

This makes me conclude that we only need to process ally units in this method if self.reward_only_positive is false, otherwise we can ignore the first loop. I don't know how much this would affect performance (this is a method that runs on every game step, after all) but I could come up with this simplified version. I'd just like others to validate if what I said is true.

Actually, I did a quick test and got different results after changing the method, so my assumptions were wrong.