โ ๏ธ This is EXPERIMENTAL code / a repo for messing with CLIP + Sparse Autoencoders (SAE)- For 'good, known-working' code (and more scripts + info), please see zer0int/CLIP-fine-tune!
๐จ
- Contains the code used to fine-tune my model HF: zer0int/CLIP-SAE-ViT-L-14 ๐ค
- See the "attack" folder to obtain datasets required / used in 'a1-finetune.py'
- Gradients will be very large throughout training. Comment out 'monitor_gradient_norms' as needed
- Use a2 to convert GmP model back to .weight after fine-tune -> normal CLIP model (use in any 'import clip' downstream tasks)
- Use a4 to quickly zero-shot test the 3 typographic attack test images provided
๐
- The attack dataset was curated via SAE
- Selected for typographic attack salience (i.e. CLIP's 'text obsession' -> misclassifies image, as text is highly salient to model)
- Fine-tune: Geometric Parametrization (GmP) + scaling of 'text salient' neurons top stimulating images (via SAE)
- For details about GmP, see my other repo: zer0int/CLIP-fine-tune
๐ฌ
- Info: Toy Models of Superposition | Perturbing a single feature
- Reasoning: Brute-force snap those geometric bonds, hoping to force CLIP model to find better (less text obsessed) solution ๐
- ...Until I learn / find out what I am actually doing here (with regard to Sparse Autoencoders), at least. =)
- Sparse Autoencoder inspiration:
- Anthropic.AI research "Golden Gate Claude" + SAE details
- OpenAI: Top-K activation function (replace ReLU in Sparse Autoencoders), arxiv
๐กโ
- My SAE: Encoder-Decoder, tied weights + Top-K (puzzled together from the above!)
- Is this a good autoencoder for CLIP? I don't know. ๐ค
- Small hidden dimension + low Top-K => very sparse -> will learn concepts from CLIP that [with SAE-reconstructed embeds] retrieve images of very narrow concepts, e.g. ONLY stop signs.
- Huge hidden dimension (e.g. 8192) -> not so sparse, accuracy drops, more (seemingly) random encoded concepts (judging via image retrieval)
- Intermediate -> Learns complex, surprising, but meaningful concepts that are 'totally an AI-thing to encode'
- Alas: SAE empirically shown to be 'working', but is it good? What is BEST? ๐ค
- Should I be using projection? Going 'back up' in the model with pinv? Hook into residual stream? I don't (yet) know! ๐คท
- I will publish the code for the SAE once I am more confident in that I know what I am actually doing (and cleaned up the mess of a code ๐).
๐คช For now, here's a fun concept of "things on the back of other things" in CLIP ViT-L/14 that the SAE learned:
Example of the effect of images the SAE had chosen as salient typographic attacks for CLIP.
And zero-shot results via script (4):