Experiments done during SERI MATS (Summer 2023)

Relation to research writeups

Activation steering with a "refusal vector" to cause llama-2-chat model to stop refusing to answer harmful questions.

Activation steering to modulate sycophancy in llama-2-chat and llama-2 base model.

Activation addition experiments (pure act-adds from single forward passes)

Logit-lens experiments (directly decoding intermediate activations by passing them through unembedding layer)

Early stage experiments to try and measure whether LLMs are aware of their internal uncertainty over a prediction