Add SignCLIP

Question

Opened this issue 3 months ago · 2 comments

Answer 1 · 2024-08-20T20:30:44.000Z

Progress/notes:

branch: paper/jiangSignCLIPConnectingText2024

Answer 2 · 2024-08-20T20:54:11.000Z

My initial summary of some key points

VideoCLIP, but on Sign Languages. Code is even based on theirs.
specifically, pretrained on SpreadTheSign. 500 hours of signing data
Text embeddings from a frozen BERT model, 768 long
Experiments with various visual encoders, which get projected to an embedding of the same size, 768
Loss function "we employ the InfoNCE loss (Oord et al., 2018)", what even is that? https://www.semanticscholar.org/paper/Representation-Learning-with-Contrastive-Predictive-Oord-Li/b227f3e4c0dc96e5ac5426b85485a70f2175a205
Evaluation on retrieval task
Code is at https://github.com/J22Melody/fairseq/tree/main/examples/MMPT, I got it running on my laptop

Also interesting:

Encoders include: