Add SignCLIP
Opened this issue · 2 comments
cleong110 commented
Add https://arxiv.org/abs/2407.01264 to the site.
Checklist
- sync, pull and merge master first!
- Search for the correct citation on Semantic Scholar
- Make a new branch ("You should always branch out from master")
- Add citation to references.bib. If dataset, prepend with
dataset:
. Exclude wordy abstracts. (better BibTex extension to Zotero can exclude keys) - Check for egregious
{}
in the bibtex - write a summary and add to the appropriate section in index.md.
- Make sure the citation keys match.
- Add a newline after each sentence in a paragraph. Still shows up as one paragraph but makes git stuff easier.
- ChatGPT 3.5 can suggest rewrites and improve writing.
- Check if acronyms are explained
- Copy-Paste into https://dillinger.io/, see if it looks OK
- Make a PR from the branch on my fork to master on the source repo
PR:
- sync master of both forks
- git pull master on local
-
git merge master
on branch - git push
- THEN make the PR
Writing/style:
- try to describe what they did, not what the general process is.
- Don't have to describe what's in a repo
- something like "Three Letter Acronym (TLA)" is how you introduce acronyms
- Look through the Style Guide on the README
- Evaluations on X and Y datasets should have a the. Evaluation on the X and Y datasets"
cleong110 commented
Progress/notes:
Citation: apparently only on arxiv thus far, according to Semantic Scholar: https://www.semanticscholar.org/paper/SignCLIP%3A-Connecting-Text-and-Sign-Language-by-Jiang-Sant/75a7a3ab20a620f612db3337fcf6df03b304242d
branch: paper/jiangSignCLIPConnectingText2024
cleong110 commented
My initial summary of some key points
- VideoCLIP, but on Sign Languages. Code is even based on theirs.
- specifically, pretrained on SpreadTheSign. 500 hours of signing data
- Text embeddings from a frozen BERT model, 768 long
- Experiments with various visual encoders, which get projected to an embedding of the same size, 768
- Loss function "we employ the InfoNCE loss (Oord et al., 2018)", what even is that? https://www.semanticscholar.org/paper/Representation-Learning-with-Contrastive-Predictive-Oord-Li/b227f3e4c0dc96e5ac5426b85485a70f2175a205
- Evaluation on retrieval task
- Code is at https://github.com/J22Melody/fairseq/tree/main/examples/MMPT, I got it running on my laptop
Also interesting:
- every experiment is maximum of one A100-day
- straight VideoCLIP basically just has "random guess" accuracy
Encoders include:
- VideoSwin
- S3D from How2100M
- I3D from BSL-1K
- MediaPipe Holistic, which "