Vectors in the wild
How do you represent a Transformer base language model as a vector? We want to derive vector representations of fine-tuned model and see if we can find a way to cluster them. We leverage the following intuition. Fine-tuning a language model amount for turning each of its parameters
The vast majority of language models rely on the transformer architecture. It is an encoder-decoder neural network that was initially build for machine translation tasks before being extended to more use cases. As a matter of fact, in a Transformer Encoder we have :
- An embedding layer : It takes as input a sequence of tokens
$x = [x_1, \ldots x_n]$ and turn it into a matrix$X \in \mathbb{R}^{n\times d_{model}}$ . - A self-attention module : The self-attention module usually works with multi-head attention. It is characterized by 4 matrices :
-
$W_q \in \mathbb{R}^{d_{model}\times d_{model}}$ : It is the matrix of keys. In reality, we can represent it as a matrix in$\mathbb{R}^{h\times d_{model} \times \frac{d_{model}}{h}}$ in order to make it clear that we use multihead attention where$h$ is the number of heads. ($d_k = \frac{d_{model}}{h})$ -
$W_k$ : It is the matrix of keys. -
$W_v$ : It is the matrix of values -
$W_o$ : It is the output matrix in multi-head attention. For each head we have$softmax \left ( \frac{XW_q (XW_k)^T}{\sqrt{d_k}} \right )XW_v \in \mathbb{R}^{n\times d_k}$ . We concatenate each of this output (rows-wise, to obtain a matrix in$\mathbb{R}^{n \times d_{model}}$ ) and we multiply the result by$W_o$ .
-
- A feedforward network : It is characterized by 2 matrices :
-
$W_1 \in \mathbb{R}^{d_{model} \times d_{ff}}$ where either$d_{ff} = 4d_{model}$ or$d_{ff} = \frac{8}{3}d_{model}$ in practice. $W_2 \in \mathbb{R}^{d_{ff} \times d_{model}}$ - We do not consider the biases (They are not always used in practice either).
-
An Encoder-based model have r
. A matrix
When it comes to encoder-decoder models, it is a bit different. The final representation is the concatenation of the encoder representation and the decoder representation. The algorithm is the same as above. The only difference is that the decoder layers need to incorporate the cross-attention parameters in the above computation.
Research questions :
- Can we cluster our representation per tasks/fine-tuning datasets?
In order to respond to this question, we can study the following language models :
For each of these models, we can find a collection of their fine-tuned version by leveraging the HfApi. We can then compute a matrix of size (number of fine-tuned models bert-base-uncased
It is also possible to compute any metric of interest. For each dataset, we know how many models was fine-tuned on it. Given a fine-tuned model
Research questions :
- How does the rank of the parameters of a models as a function of their depth in the network?
For a parameter matrix query
, key
, value
, output
, feedforward1
, feedforward2
) as we go from layer
This work would not have been possible without Hugging Face.