CLIP-SAE-finetune: A Python repository from 5ky9uy

⚠️ This is EXPERIMENTAL code / a repo for messing with CLIP + Sparse Autoencoders (SAE)
For 'good, known-working' code (and more scripts + info), please see zer0int/CLIP-fine-tune!

🔨

Contains the code used to fine-tune my model HF: zer0int/CLIP-SAE-ViT-L-14 🤗
See the "attack" folder to obtain datasets required / used in 'a1-finetune.py'
Gradients will be very large throughout training. Comment out 'monitor_gradient_norms' as needed
Use a2 to convert GmP model back to .weight after fine-tune -> normal CLIP model (use in any 'import clip' downstream tasks)
Use a4 to quickly zero-shot test the 3 typographic attack test images provided

🔎

The attack dataset was curated via SAE
Selected for typographic attack salience (i.e. CLIP's 'text obsession' -> misclassifies image, as text is highly salient to model)
Fine-tune: Geometric Parametrization (GmP) + scaling of 'text salient' neurons top stimulating images (via SAE)
For details about GmP, see my other repo: zer0int/CLIP-fine-tune

🔬

Info: Toy Models of Superposition | Perturbing a single feature
Reasoning: Brute-force snap those geometric bonds, hoping to force CLIP model to find better (less text obsessed) solution 😅
...Until I learn / find out what I am actually doing here (with regard to Sparse Autoencoders), at least. =)
Sparse Autoencoder inspiration:
Anthropic.AI research "Golden Gate Claude" + SAE details
OpenAI: Top-K activation function (replace ReLU in Sparse Autoencoders), arxiv

💡❓

My SAE: Encoder-Decoder, tied weights + Top-K (puzzled together from the above!)
Is this a good autoencoder for CLIP? I don't know. 🤔
Small hidden dimension + low Top-K => very sparse -> will learn concepts from CLIP that [with SAE-reconstructed embeds] retrieve images of very narrow concepts, e.g. ONLY stop signs.
Huge hidden dimension (e.g. 8192) -> not so sparse, accuracy drops, more (seemingly) random encoded concepts (judging via image retrieval)
Intermediate -> Learns complex, surprising, but meaningful concepts that are 'totally an AI-thing to encode'
Alas: SAE empirically shown to be 'working', but is it good? What is BEST? 🤔
Should I be using projection? Going 'back up' in the model with pinv? Hook into residual stream? I don't (yet) know! 🤷
I will publish the code for the SAE once I am more confident in that I know what I am actually doing (and cleaned up the mess of a code 😂).

🤪 For now, here's a fun concept of "things on the back of other things" in CLIP ViT-L/14 that the SAE learned:

Example of the effect of images the SAE had chosen as salient typographic attacks for CLIP.

And zero-shot results via script (4):

5ky9uy/CLIP-SAE-finetune