yfeng95/PoseGPT

Clarification on obtaining the embedding related to the <POSE> token

AndrejHafner opened this issue · 1 comments

Hello! First of all, thank you for the great article. I have a question about how you obtain the embedding related to the token, which is then projected and used for human pose reconstruction. If I understand correctly, when the model outputs a token, you take the logits from the last layer of the LLM (on which softmax was applied and from the resulting distribution the token was sampled) and use those as embeddings?

I think it's the last-layer embedding(hidden_states, before logits) corresponding to the <POSE> token. You can reference LISA https://github.com/dvlab-research/LISA.