epfl-dlab/llm-latent-language

Need help understanding the code.

Closed this issue · 6 comments

Thank you for sharing your code.

I am trying to understand few things in here. I have mentioned three questions.
As mentioned in the repo :

For your convenience, we also provide some precomputed latents on [huggingface.](https://huggingface.co/datasets/wendlerc/llm-latent-language) Here are some [preliminary steering experiments](https://colab.research.google.com/drive/1EhCk3_CZ_nSfxxpaDrjTvM-0oHfN9m2n?usp=sharing) using the precomputed latents.

  1. What are those precomputed latents? and What are their purpose?

  2. The steering experiment as provided in the Colab, what is that doing ? I could not find the details in the paper.

  3. The Colab link provided ( https://colab.research.google.com/drive/1l6qN-hmCV4TbTcRZB5o6rUk_QPHBZb7K?usp=sharing ), is it the main code to visualize the pivot from English language?

Thank you

Huggingface repo: wendlerc/llm-latent-language

Hi,

Let me try to answer your questions:

  1. The precomputed latents are the latent representations of the last token after each transformer block. The purpose is to save computational time, since Llama-2 forward passes are slow.

  2. The steering experiment goes beyond the paper but is cute which is why included it into one of the colab notebooks.

  3. No, the main code required to reproduce the results from the paper is contained in the ipynb files in this repository. The colab link provided is just nice to explore some prompts by hand and inspect their logit lens decodings.

Best,
Chris

Thank you so much for clearing this. :)

Regarding steering experiment, can you please provide some resources that I could refer to?

I would appreciate it a a lot.

We are basically following a strategy similar to https://arxiv.org/abs/2312.06681

Just our case we can create paired prompts that only differ only in language very easily and don't even need the (A)/(B) prompt structure that they used.

Here are some of my thoughts on the topic of steering:

Thoughts on steering

Superposition theory (toy models, monosemanticity) suggests that neural networks represent features as vectors (e.g., of neuron activations; or of the stuff that is in the residual stream).

Let's suppose this holds. A latent at layer $i$ takes the form
$$z = \sum_{j} \alpha_j f_j,$$ with $\alpha_j \in \mathbb{R}$ and $f_j \in \mathbb{R}^d$.

Given this linear representation, a representation leveraging an abstract concept space for dealing with, e.g., multilingual data, could look like this:
$$z = z_{\text{concept}} + z_{\text{decoding language}} + z_{\text{rest}}.$$

Now, if we had a method to compute $z_{\text{decoding language}}$ or $\triangle = z_{\text{target language}} - z_{\text{source language}}$, we could change the output language by the following
intervention:
$$z' = z - z_{\text{source language}} + z_{\text{target language}} = z + \triangle.$$

Simplified setting

Let's consider, e.g., $\ell_1 = \text{RU}$ as source language and $\ell_2 = \text{ZH}$ as target language and the following simplified model $$z = z_{\text{target language}} + z_{\text{rest}},$$
with $z_{\text{rest}} \sim N(0, \sigma)$.

We can estimate $z_{\ell}$ using a dataset of latents $D_{\ell}$, with $|D_{\ell}| = n$ that all share the feature $z_{\ell} \in \mathbb{R}^d$:

$$z_{\ell} \approx \frac{1}{n}\sum_{z \in D_{\ell}} z = z_{\ell} + \underbrace{\frac{1}{n} \sum_{k} z_{r_k}}_{\approx 0}.$$

Better (what ppl. usually do)

We can drop the assumption $z_{\text{rest}} \sim N(0, \sigma)$ by observing
$$\mu = \frac{1}{n} \sum_{z \in D_{\ell}} z = z_{\ell} + \mu_r,$$
since $z_{\ell}$ is shared among all examples. As a result, we can compute $\triangle$ by computing the difference $$\triangle = \mu_2 - \mu_1 = \mu_{r} + z_{\ell_2} - \mu_{r} - z_{\ell_1} = z_{\ell_2} - z_{\ell_1}.$$

Thank you so much for these resources. 🙌

Happy to help!