llm-interpretability

There are 6 repositories under llm-interpretability topic.

PaulPauls/llama3_interpretability_sae
A complete end-to-end pipeline for LLM interpretability with sparse autoencoders (SAEs) using Llama 3.2, written in pure PyTorch and fully reproducible.
Language:Python625 5 136
basics-lab/spectral-explain
Fast XAI with interactions at large scale. SPEX can help you understand the output of your LLM, even if you have a long context!
Language:Jupyter Notebook12 3 00
BeekeepingAI/hexray
🔬 HexRay: An Open-Source Neuroscope for AI — Tracing Tokens, Neurons, and Decisions for Frontier AI Research, Safety, and Security
Language:Python1
Luisibear98/intervention-jailbreak
This project explores methods to detect and mitigate jailbreak behaviors in Large Language Models (LLMs). By analyzing activation patterns—particularly in deeper layers—we identify distinct differences between compliant and non-compliant responses to uncover a jailbreak "direction." Using this insight, we develop intervention strategies that modify
Language:Python1 1 00
helenmand/IMDB-movie-reviews-sentiment-explainer
Fine-tuned DistilBERT for binary sentiment analysis on IMDB movie reviews with token-level interpretability using LayerIntegratedGradients
Language:Jupyter Notebook
peppinob-ol/attribution-graph-probing
Automates attribution-graph analysis via probe prompting: circuit-trace a prompt, auto-generate concept probes, profile feature activations, cluster supernodes.
Language:Python