Need help understanding the code.
Closed this issue · 6 comments
Thank you for sharing your code.
I am trying to understand few things in here. I have mentioned three questions.
As mentioned in the repo :
For your convenience, we also provide some precomputed latents on [huggingface.](https://huggingface.co/datasets/wendlerc/llm-latent-language) Here are some [preliminary steering experiments](https://colab.research.google.com/drive/1EhCk3_CZ_nSfxxpaDrjTvM-0oHfN9m2n?usp=sharing) using the precomputed latents.
-
What are those precomputed latents? and What are their purpose?
-
The steering experiment as provided in the Colab, what is that doing ? I could not find the details in the paper.
-
The Colab link provided ( https://colab.research.google.com/drive/1l6qN-hmCV4TbTcRZB5o6rUk_QPHBZb7K?usp=sharing ), is it the main code to visualize the pivot from English language?
Thank you
Huggingface repo: wendlerc/llm-latent-language
Hi,
Let me try to answer your questions:
-
The precomputed latents are the latent representations of the last token after each transformer block. The purpose is to save computational time, since Llama-2 forward passes are slow.
-
The steering experiment goes beyond the paper but is cute which is why included it into one of the colab notebooks.
-
No, the main code required to reproduce the results from the paper is contained in the ipynb files in this repository. The colab link provided is just nice to explore some prompts by hand and inspect their logit lens decodings.
Best,
Chris
Thank you so much for clearing this. :)
Regarding steering experiment, can you please provide some resources that I could refer to?
I would appreciate it a a lot.
We are basically following a strategy similar to https://arxiv.org/abs/2312.06681
Just our case we can create paired prompts that only differ only in language very easily and don't even need the (A)/(B) prompt structure that they used.
Here are some of my thoughts on the topic of steering:
Thoughts on steering
Superposition theory (toy models, monosemanticity) suggests that neural networks represent features as vectors (e.g., of neuron activations; or of the stuff that is in the residual stream).
Let's suppose this holds. A latent at layer
Given this linear representation, a representation leveraging an abstract concept space for dealing with, e.g., multilingual data, could look like this:
Now, if we had a method to compute
intervention:
Simplified setting
Let's consider, e.g.,
with
We can estimate
Better (what ppl. usually do)
We can drop the assumption
since
Thank you so much for these resources. 🙌
Happy to help!