Explainability-for-Large-Language-Models: A Survey

đź“– Papers and resources related to our survey("Explainability for Large Language Models: A Survey") are organized by the structure of the paper.

Table of Contents


We categorize LLM explainability into two major paradigms. Based on this categorization, kinds of explainability techniques associated with LLMs belonging to these two paradigms are summarized as following:

Paper Structure

Training Paradigms of LLMs

Traditional Fine-Tuning Paradigm

A language model is first pre-trained on a large corpus of unlabeled text data, and then fine-tuned on a set of labeled data from a specific downstream domain.

  1. Datasets

    SST-2, MNLI, QQP , etc.

  2. Models


Prompting Paradigm

The prompting paradigm involves using prompts, such as natural language sentences with blanks for the model to fill in, to enable zero-shot or few-shot learning without requiring additional training data. Models under this paradigm can be categorized into two types, based on their development stages: base model and assistant model. In this scenario, LLMs undergo unsupervised pre-training with random initialization to create a base model. The base model can then be fine-tuned through instruction tuning and RLHF to produce the assistant model.

Prompting training procedure

  1. Base Model

    GPT-3, OPT, LLaMA-1, LLaMA-2, Falcon, etc.

  2. Assistant Model

    GPT-3.5, GPT 4, Claude, LLaMA-2-Chat, Alpaca, Vicuna, etc.

Explanation for Traditional Fine-Tuning Paradigm

Local Explanation

Local Explanation focus on understanding how a language model makes a prediction for a specific input instance.

Local Explanation for LLMs

Feature Attribution-Based Explanation

Perturbation-Based Explanation
Attention-Based Explanation

Example-Based Explanations

Adversarial Example
Natural Language Explanation

Global Explanation

Global Explanation aims to provide a broad understanding of how the LLM work in the level of model components, such as neurons, hidden layers and larger modules.

Probing-Based Explanation

Classifier-Based Probing
Neuron Activation Explanation

Concept-Based Explanation

Making Use of Explanations

Debugging Models

Improving Models

Explanation for Prompting Paradigm

In prompting paradigm, LLMs have shown impressive reasoning abilities including few-shot learning, chain-of-thought prompting ability and phenomena like hallucination, which lack in conventional fine-tuning paradigm. Given these emerging properties, the explainability research is expected to investigate the underlying mechanisms. The explanation towards prompting paradigm can be categorized into two folds following model development stages: base model explanation and assistant model explanation.

Base Model Explanation

Explanations Benefit Model Learning

Explaining In-context Learning

Explaining CoT Prompting

Assistant Model Explanation

Explaining the Role of Fine-tuning

Explaining Hallucination and Uncertainty

Making Use of Explanations

Improving LLMs

Downstream Applications

Explanation Evaluation

Explanation can be evaluated in multiple dimensions according to different metrics, such as plausibility, faithfulness, stability, etc. For each dimension, metrics can hardly align well with each other. Constructing standard metrics still remains an open challenge. In this part, we focus on two dimension: plausibility and faithfulness. And quantitative properties and metrics, which are usually more reliable than qualitative ones, are presented in detail.

Explanation Evaluations in Traditional Fine-tuning Paradigms

Evaluating plausibility

Evaluating Faithfulness

Evaluation of Explanations in Prompting Paradigms

Evaluating Plausibility

Evaluating Faithfulness

