Clarification on obtaining the embedding related to the <POSE> token
AndrejHafner opened this issue · 1 comments
AndrejHafner commented
Hello! First of all, thank you for the great article. I have a question about how you obtain the embedding related to the token, which is then projected and used for human pose reconstruction. If I understand correctly, when the model outputs a token, you take the logits from the last layer of the LLM (on which softmax was applied and from the resulting distribution the token was sampled) and use those as embeddings?
JJJYmmm commented
I think it's the last-layer embedding(hidden_states, before logits) corresponding to the <POSE> token. You can reference LISA https://github.com/dvlab-research/LISA.