thu-ml/tianshou

Adjust locations of setting the policy in train/eval mode

Opened this issue · 1 comments

Currently, tianshou sets the policy's mode in the trainer and test_episode function. The corresponding training attribute is then used to determine if a stochastic policy should be evaluated deterministically given that policy.deterministic_eval is True. This, however, is a misuse as the training attribute primarily has influence on modules like dropout and batchnorm. It should always be False during data collection and only be True inside policy.learn.

Max and I have implemented the following solution in #1123:

  • We Introduced a new flag is_within_training_step which is enabled by the training algorithm when within a training step, where a training step encompasses training data collection and policy updates. This flag is now used by algorithms to decide whether their deterministic_eval setting should indeed apply instead of the torch training flag (which was abused!).
  • The policy's training/eval mode (which should control torch-level learning only) no longer needs to be set in user code in order to control collector behaviour (this didn't make sense!). The respective calls have been removed.
  • The policy should, in fact, always be in evaluation mode when applying data collection, as there is no reason to ever have gradient accumulation enabled for any type of rollout. We thus specifically set the policy to evaluation mode in Collector.collect.