llm-interpretability

There are 6 repositories under llm-interpretability topic.

  • PaulPauls/llama3_interpretability_sae

    A complete end-to-end pipeline for LLM interpretability with sparse autoencoders (SAEs) using Llama 3.2, written in pure PyTorch and fully reproducible.

    Language:Python6255136
  • basics-lab/spectral-explain

    Fast XAI with interactions at large scale. SPEX can help you understand the output of your LLM, even if you have a long context!

    Language:Jupyter Notebook12300
  • BeekeepingAI/hexray

    🔬 HexRay: An Open-Source Neuroscope for AI — Tracing Tokens, Neurons, and Decisions for Frontier AI Research, Safety, and Security

    Language:Python1
  • Luisibear98/intervention-jailbreak

    This project explores methods to detect and mitigate jailbreak behaviors in Large Language Models (LLMs). By analyzing activation patterns—particularly in deeper layers—we identify distinct differences between compliant and non-compliant responses to uncover a jailbreak "direction." Using this insight, we develop intervention strategies that modify

    Language:Python1100
  • helenmand/IMDB-movie-reviews-sentiment-explainer

    Fine-tuned DistilBERT for binary sentiment analysis on IMDB movie reviews with token-level interpretability using LayerIntegratedGradients

    Language:Jupyter Notebook
  • peppinob-ol/attribution-graph-probing

    Automates attribution-graph analysis via probe prompting: circuit-trace a prompt, auto-generate concept probes, profile feature activations, cluster supernodes.

    Language:Python