This repository contains the research and analysis conducted for the thesis on the mechanistic interpretability of vision transformers. It explores how vision transformers process image data, with a focus on the internal mechanism, polysemantic neurons, and the application of sparse autoencoders to disentangle these neurons.
- The thesis identifies three distinct phases within the layers of supervised pretrained vision transformers, highlighting consistent behavior within each phase.
- Key features play essential roles in each phase, impacting model behavior significantly.
- A comparison with self-supervised models indicates clear differences in model behavior across layers.
- The study employs sparse autoencoders to mitigate polysemanticity, revealing more interpretable features and providing insights into the model's internal mechanism.