Some question with CQL
Opened this issue · 0 comments
dbsxdbsx commented
First, thanks your implementation of so many CQL. The below question are some related to your implementation, and some are related to CQL itself.
- why the returned value of function
_compute_policy_values
in CQL-SAC isqs1 - log_pis.detach(), qs2 - log_pis.detach()
with detached log_pis, I think it should not be detached. - what is the meaning of
self.temp
andself.cql_weight
in CQL-SAC?I thinkself.cql_weight
is duplicated ascql_alpha
has a similar meaning. - Is it essential to use two q states in cql?
- In CQL-SAC-Discrete, I think the
q1
insidecql1_scaled_loss = torch.logsumexp(q1, dim=1).mean() - q1.mean()
should be an expect over all optional q(s,a), but not the best one, am I wrong?
5.In CQL-SAC, whyretain_graph=True
for the Lagrange and critic optimizer? - the most important question: according to p29 from paper, for continuous action, to calc the
logsumexp
object, both q from uniform and q from pi are used, but why also use actions from pi here? I asked also here, but still at a loss.
And I know some CQL question should be ask from the original repo, but the author is no longer active.