DLR-RM/stable-baselines3

[Question] Shared feature extractor and gradient

brn-dev opened this issue · 2 comments

❓ Question

Hi, I have two questions:

Firstly, the comment here says the opposite of the code. I am assuming the code is the intended behavior, right?

# Learn the features extractor using the policy loss only
# when the features_extractor is shared with the actor
with th.set_grad_enabled(not self.share_features_extractor):
features = self.extract_features(obs, self.features_extractor)

Secondly, does this only make sense in off policy settings (when using q-critics)? I checked out the ActorCriticPolicy which is used in on policy algorithms and there, both the actor and the critic loss influence the feature extractors parameter (if shared):

features = self.extract_features(obs)
if self.share_features_extractor:
latent_pi, latent_vf = self.mlp_extractor(features)
else:
pi_features, vf_features = features
latent_pi = self.mlp_extractor.forward_actor(pi_features)
latent_vf = self.mlp_extractor.forward_critic(vf_features)

def extract_features( # type: ignore[override]
self, obs: PyTorchObs, features_extractor: Optional[BaseFeaturesExtractor] = None
) -> Union[th.Tensor, Tuple[th.Tensor, th.Tensor]]:
"""
Preprocess the observation if needed and extract features.
:param obs: Observation
:param features_extractor: The features extractor to use. If None, then ``self.features_extractor`` is used.
:return: The extracted features. If features extractor is not shared, returns a tuple with the
features for the actor and the features for the critic.
"""
if self.share_features_extractor:
return super().extract_features(obs, self.features_extractor if features_extractor is None else features_extractor)

def forward(self, features: th.Tensor) -> Tuple[th.Tensor, th.Tensor]:
"""
:return: latent_policy, latent_value of the specified network.
If all layers are shared, then ``latent_policy == latent_value``
"""
return self.forward_actor(features), self.forward_critic(features)

The features generated through the first feature extractor (self.feature_extractor) is passed into the mlp_extractor where it is used to form the actor and value latent features without any disruption to the gradient. So, both the actor and the critic loss influence this feature extractor, whereas in the ContinuousCritic, we prevent the feature extractor from being updated from the critic loss if it is shared with the actor.

Checklist

Firstly, the comment here says the opposite of the code.

I think the comment is correct although a bit confusing.
When the feature extractor is shared, you don't won't gradient to be back propagated through the critic, hence set_grad_enabled(not self.share_features_extractor).

Secondly, does this only make sense in off policy settings

This choice is made by experience. The recommendation is actually not to share the feature extractor when using off-policy algorithms. Another choice would be to use the critic loss only to learn the feature, but it is not recommended to use both losses when sharing the feature extractor (they can conflict).

ActorCriticPolicy which is used in on policy algorithms

In on-policy setting, where the actor and critic update is done in one-go, it seems (experimentally) that you can use both losses to update the features (and you can also use separate networks).

Makes sense thanks!