/Decomposing_ViT

Primary LanguageJupyter Notebook

Decomposing Vision Transformers Using Dictionary Learning

Overview

This repository contains the research and analysis conducted for the thesis on the mechanistic interpretability of vision transformers. It explores how vision transformers process image data, with a focus on the internal mechanism, polysemantic neurons, and the application of sparse autoencoders to disentangle these neurons.

Key Findings

  • The thesis identifies three distinct phases within the layers of supervised pretrained vision transformers, highlighting consistent behavior within each phase.
  • Key features play essential roles in each phase, impacting model behavior significantly.
  • A comparison with self-supervised models indicates clear differences in model behavior across layers.
  • The study employs sparse autoencoders to mitigate polysemanticity, revealing more interpretable features and providing insights into the model's internal mechanism.