RLego: A Python repository from d3sm0

From RLax:

Many functions consider policies, actions, rewards, values, in consecutive timesteps in order to compute their outputs. In this case the suffix _t and tm1 is often to clarify on which step each input was generated, e.g:

    q_tm1: the action value in the source state of a transition.
    a_tm1: the action that was selected in the source state.
    r_t: the resulting rewards collected in the destination state.
    discount_t: the discount associated with a transition.
    q_t: the action values in the destination state.

Key differences:

we do not allow stop gradient in the objective because it is more efficient to use torch.no_grad at evaluation time and detach() does not seems to play well with vmap.

d3sm0/RLego