airsplay/R2R-EnvDrop

About make_action when feedback method is 'sample' or 'argmax'

YicongHong opened this issue · 4 comments

Hi there,

I have a question about how the agent makes action when the feedback method is 'sample' or 'argmax'. In function make_equiv_action, it seems that even if the agent previously decided to STOP, it can keep making new actions and the trajectory will be updated accordingly. This is fine in Training, since the reward and reward_mask are set to zeros if ended is TRUE. But in testing and evaluation, one agent chooses the final viewpoint as where it is when all agent end, instead of when itself first decide to STOP. And this could make a huge difference in measuring the path length and success rate.

Could you elaborate more on this please?

Thanks a lot!
Yicong

Thanks. The action will be ignored after the agent is ended:

if ended[i]: # Just ignore this index
a[i] = args.ignoreid

and will not be executed:
if next_id == (candidate_leng[i]-1) or next_id == args.ignoreid: # The last action is <end>
cpu_a_t[i] = -1 # Change the <end> and ignore action to -1

Hi Hao, thanks for your prompt reply.

a[i] = args.ignoreid is set in getting teacher action target = self._teacher_action(perm_obs, ended), and a_t = target only when feedback method is 'teacher'. But when feedback method is 'argmax' or 'sample', a_t is directly argmax() or sample() from logit.

In function make_equiv_action(), the condition for making action and update trajectory is only if action != -1, rather than if action != -1 & ended[i]==False. Which means, even if the agent previously decided to STOP, it can keep making new actions and the trajectory will be updated accordingly, isn't it?

A good point! It seems that I carelessly did not stop the agent from moving after seeing the [stop] action. I have just tested the models under different evaluating protocols (based on the model shared here). The success rates are almost the same, namely 51.98% vs 51.89%. Full results here:

Original (do not stop after [stop] action):

Env name: val_unseen, nav_error: 5.2584, oracle_error: 3.4992, steps: 24.0558, lengths: 9.3711, 
 success_rate: 0.5198, oracle_rate: 0.5862, spl: 0.4814
Env name: val_seen, nav_error: 4.0279, oracle_error: 2.5805, steps: 25.4976, lengths: 10.1421, 
 success_rate: 0.6317, oracle_rate: 0.7140, spl: 0.6051

New (stop after [stop] action):

Env name: val_unseen, nav_error: 5.2628, oracle_error: 3.5013, steps: 24.0366, lengths: 9.3662, 
 success_rate: 0.5189, oracle_rate: 0.5854, spl: 0.4809
Env name: val_seen, nav_error: 4.0255, oracle_error: 2.5805, steps: 25.4927, lengths: 10.1385, 
 success_rate: 0.6327, oracle_rate: 0.7140, spl: 0.6060

Although the result is not significantly changed, it's my bad to have this bug in my code and I am going to switch the code to the correct version. I was lucky in this task where the length of different trajectories are almost the same but might not be that lucky next time TAT.

What I did now is to change the original if statement to:

if next_id == (candidate_leng[i]-1) or next_id == args.ignoreid or ended[i]:

Does the above code look correct? If so, I would update the repo.

Thx!

Yeah, I think it looks good now. Thanks xD

Yicong