GuyTevet/MotionCLIP

SImilarities computed using motion and text embeddings are incorrect

sohananisetty opened this issue · 2 comments

Like CLIP, where we compute the image and text embeddings and compute the similarities to retrieve the best matching text, I tried the same using motion and text, but it does not work.

Eg. Using the AMASS dataset and bs = 2; texts: 'jump', 'dancing',

emb = enc.encode_motions(batch['x']).to(device)
emb /= emb.norm(dim=-1, keepdim=True)

text_inputs = torch.cat([clip.tokenize(c) for c in batch["clip_text"]]).to(device)
text_features = clip_model.encode_text(text_inputs).float()
text_features /= text_features.norm(dim=-1, keepdim=True)

logit_scale = clip_model.logit_scale.exp()
similarity = (logit_scale * emb @ text_features.float().T).softmax(dim=-1)

values, indices = similarity[0].topk(len(batch["clip_text"]))

# Print the result
print("\nTop predictions:\n")
for value, index in zip(values, indices):
    print(f"{batch['clip_text'][index]:>16s}: {100 * value.item():.2f}%")

Expected output for similarity[0] -> high "jump" probability
But I get a high "dance" probability output. I have tested this with multiple batches and the correct text does not get the highest similarity a majority of the times. Am I inferencing it wrong?

That's weird. Your code looks good to me, but we do know that the cosine similarity should work to some extent according to the action classification experiment. Did you try using it as a reference?

I ran the script using the general model. I was getting:

Top-5 Acc. : 29.86%  (637/2133)
Top-1 Acc. : 13.41%  (286/2133)

Using the finetuned model:

Top-5 Acc. : 63.72%  (1354/2125)
Top-1 Acc. : 44.99%  (956/2125)

I assumed the zero-shot nature of CLIP would at least provide some generalizability. But that does not seem to be the case.