the standard deviation of the activation

Question

the standard deviation of the activation

Opened this issue 3 months ago · 3 comments

Hello, I am interested in the standard deviation of the activation and would like to know how the variance is calculated. Here are a few methods:

Calculate the variance for 100 sequences and display it for a specific layer in the table below.
Calculate the variance for 100 sequences and the layers with relatively large values (e.g., layers 2-30).
Calculate the variance for all layers.

Could you please specify which of the above situations applies?

Thanks.

Answer 1 · 2024-07-01T17:07:40.000Z

Thanks for your interest in our work. That would be option 1. This table shows the activation deviation within a fixed layer.

Answer 2 · 2024-07-02T03:05:58.000Z

Thanks a lot.
So we just calculate the standard deviation of 100 values. Take the top 1 as an example: it might be the 2533rd dimension of the starting token in the 15th layer. We collect 100 such values and then compute the standard deviation.

Answer 3 · 2024-07-03T16:20:54.000Z

Yes, that's correct.