openai/automated-interpretability

Problem about activation calculation

Daftstone opened this issue · 3 comments

I would like to know how neuron activation is calculated and how to map neuron activation to each input token. Or can you provide me with related work on calculating neuron activation, I would be very grateful.

Yes, I have the same question regarding to the calculation of token-level activations.
It is not clear in both the paper and code.
If anyone could give some hints, I would also be very grateful.

Dear authors,

I found that this section provides the definition of neuron-token-level connection weights. First, I want to confirm if the word-neuron activation is extracted based on this section. I am confused because it seems that this activation does not take into account the context information. Specifically, according to the equation h{l}.mlp.c_proj.w[:, n, :] @ diag(ln_f.g) @ wte[t, :], the output weight of a neuron (l, n) to the token t appears to be independent of other tokens.

I would greatly appreciate it if someone could address my confusion and provide clarification on this matter.

Best,
Xuansheng

yes, that's right - it doesn't take context information into account. it would probably be better to use something activation instead of weight based