I think overall this was a good way to get some hands on experience working with SAE's and applying some MI concepts to a problem.
MATS 6.0 Application google doc summarizing my findings. I worked on finding interesting SAE features inside a 1L transformer model and reverse engineering them.
Some of the interesting features I looked at were:
- A
‘t
feature that is activate on the't
token in presence of words like don't, doesn't, etc. - A context dependent feature that activates on tokens
or
,and
and,
in texts are related to phone, emails, messages etc. - A close brackets feature? This feature activates on tokens immediately following an opening bracket
(
and boosts the logits of closing brackets)
. - A relatively non-sparse feature that seems to activate on tokens that are following
of
token.