llm-interpretability
There are 6 repositories under llm-interpretability topic.
PaulPauls/llama3_interpretability_sae
A complete end-to-end pipeline for LLM interpretability with sparse autoencoders (SAEs) using Llama 3.2, written in pure PyTorch and fully reproducible.
basics-lab/spectral-explain
Fast XAI with interactions at large scale. SPEX can help you understand the output of your LLM, even if you have a long context!
BeekeepingAI/hexray
🔬 HexRay: An Open-Source Neuroscope for AI — Tracing Tokens, Neurons, and Decisions for Frontier AI Research, Safety, and Security
Luisibear98/intervention-jailbreak
This project explores methods to detect and mitigate jailbreak behaviors in Large Language Models (LLMs). By analyzing activation patterns—particularly in deeper layers—we identify distinct differences between compliant and non-compliant responses to uncover a jailbreak "direction." Using this insight, we develop intervention strategies that modify
helenmand/IMDB-movie-reviews-sentiment-explainer
Fine-tuned DistilBERT for binary sentiment analysis on IMDB movie reviews with token-level interpretability using LayerIntegratedGradients
peppinob-ol/attribution-graph-probing
Automates attribution-graph analysis via probe prompting: circuit-trace a prompt, auto-generate concept probes, profile feature activations, cluster supernodes.