/LLM-Factuality-Survey

The repository for the survey paper <<Survey on Large Language Models Factuality: Knowledge, Retrieval and Domain-Specificity>>

LLM-Factuality-Survey

The repository for the survey paper "Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity"

Cunxiang Wang1,7*, Xiaoze Liu2*, Yuanhao Yue3*, Qipeng Guo4, Xiangkun Hu4, Xiangru Tang5, Tianhang Zhang6, Cheng Jiayang7, Yunzhi Yao8, Wenyang Gao1,8, Xuming Hu9, Zehan Qi9, Yidong Wang1, Linyi Yang1, Jindong Wang10, Xing Xie10, Zheng Zhang4,11 and Yue Zhang1.

1. School of Engineering, Westlake University; 2. Purdue University; 3. Fudan University; 4. Amazon AWS AI Lab; 5. Yale University; 6. Shanghai Jiao Tong University; 7. HKUST; 8. Zhejiang University; 9. Tsinghua University; 10. Microsoft Research; 11. NYU Shanghai;
(*: Equal Contribution; Correspondence to: Yue Zhang)

NOTE: As real-time updates may not be feasible for the arXiv paper. For the most recent developments and modifications, please consult this repository. We greatly appreciate and welcome pull requests or issues to enhance the quality of this survey. All contributions will be list in the acknowledgements section.

Paper List

Analysis of Factuality

Knowledge Storage

  1. Language Models as Knowledge Bases?. Petroni et al. 2019. [Paper]
  2. Locating and Editing Factual Associations in GPT. Meng et al. 2022. [Paper]
  3. Transformer Feed-Forward Layers Are Key-Value Memories. Geva et al. 2021. [Paper]
  4. Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space. Geva et al. 2022. [Paper]
  5. Dissecting Recall of Factual Associations in Auto-Regressive Language Models. Globerson et al. 2023. [Paper]
  6. Journey to the Center of the Knowledge Neurons: Discoveries of Language-Independent Knowledge Neurons and Degenerate Knowledge Neurons. Chen et al. 2023. [Paper]
  7. A rigorous study of integrated gradients method and extensions to internal neuron attributions. Lundstrom et al. 2022. [Paper]

Knowledge Awareness

  1. CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing. Gou et al. 2023. [Paper]
  2. Investigating the Factual Knowledge Boundary of Large Language Models with Retrieval Augmentation. Ren et al. 2023. [Paper]
  3. Do Large Language Models Know What They Don't Know?. Yin et al. 2023. [Paper]
  4. A Survey on In-context Learning. Dong et al. 2023. [Paper]
  5. Language Models (Mostly) Know What They Know. Kadavath et al. 2022. [Paper]
  6. The internal state of an llm knows when its lying. Azaria et al. 2023. [Paper]

Parametric Knowledge vs Retrieved Knowledge

  1. Generate rather than retrieve: Large language models are strong context generators. Yu et al. 2023. [Paper]
  2. Beyond Factuality: A Comprehensive Evaluation of Large Language Models as Knowledge Generators. Chen et al. 2023. [Paper]
  3. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering. Izacard et al. 2021. [Paper]
  4. Large language models struggle to learn long-tail knowledge. Kandpal et al. 2023. [Paper]
  5. Head-to-Tail: How Knowledgeable are Large Language Models (LLM)? AKA Will LLMs Replace Knowledge Graphs?. Sun et al. 2023. [Paper]

Contextual Influence

  1. Large Language Models with Controllable Working Memory. Li et al. 2023. [Paper]
  2. Context-faithful Prompting for Large Language Models. Zhou et al. 2023. [Paper]
  3. Benchmarking Large Language Models in Retrieval-Augmented Generation. Chen et al. 2023. [Paper]
  4. Automatic Evaluation of Attribution by Large Language Models. Yue et al. 2023. [Paper]

Knowledge Conflicts

  1. Entity-Based Knowledge Conflicts in Question Answering. Longpre et al. 2021. [Paper]
  2. Rich Knowledge Sources Bring Complex Knowledge Conflicts: Recalibrating Models to Reflect Conflicting Evidence. Chen et al. 2022. [Paper]
  3. Adaptive Chameleon or Stubborn Sloth: Unraveling the Behavior of Large Language Models in Knowledge Clashes. Xie et al. 2023. [Paper]
  4. Large Language Models with Controllable Working Memory. Li et al. 2023. [Paper]

Causes of Factual Errors

Model-level Causes

Forgetting

  1. An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks. Goodfellow et al. 2015. [Paper]
  2. Preserving In-Context Learning ability in Large Language Model Fine-tuning. Wang et al. 2022. [Paper]
  3. Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less Forgetting. Chen et al. 2020. [Paper]
  4. Investigating the Catastrophic Forgetting in Multimodal Large Language Models. Zhai et al. 2023. [Paper]
  5. An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning. Luo et al. 2023. [Paper]

Reasoning Failure

  1. We're Afraid Language Models Aren't Modeling Ambiguity. Liu et al. 2023. [Paper]
  2. The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A". Berglund et al. 2023. [Paper]
  3. Understanding Catastrophic Forgetting in Language Models via Implicit Inference. Kotha et al. 2023. [Paper]
  4. Can ChatGPT Replace Traditional KBQA Models? An In-depth Analysis of the Question Answering Performance of the GPT LLM Family. Tan et al. 2023. [Paper]

Retrieval-level Causes

Misinformation Not Recognized by LLMs

  1. Entity-Based Knowledge Conflicts in Question Answering. Longpre et al. 2021. [Paper]
  2. On the Risk of Misinformation Pollution with Large Language Models. Pan et al. 2023. [Paper]
  3. A Survey on Truth Discovery. Han et al. 2015. [Paper]

Distracting Information

  1. SAIL: Search-Augmented Instruction Learning. Luo et al. 2023. [Paper]
  2. Lost in the middle: How language models use long contexts. Liu et al. 2023. [Paper]

Misinterpretation of Related Information

  1. ReAct: Synergizing Reasoning and Acting in Language Models. Yao et al. 2023. [Paper]

Inference-level Causes

Snowballing

  1. How language model hallucinations can snowball. Zhang et al. 2023. [Paper]
  2. A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation. Varshney et al. 2023. [Paper]

Erroneous Decoding

  1. DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models. Chuang et al. 2023. [Paper]
  2. How Decoding Strategies Affect the Verifiability of Generated Text. Massarelli et al. 2020. [Paper]

Exposure Bias

  1. WinoQueer: A Community-in-the-Loop Benchmark for Anti-LGBTQ+ Bias in Large Language Models. Felkner et al. 2023. [Paper]
  2. Bias and Fairness in Large Language Models: A Survey. Gallegos et al. 2023. [Paper]
  3. MISGENDERED: Limits of Large Language Models in Understanding Pronouns. Hossain et al. 2023. [Paper]

Evaluation of Factuality

Benchmarks

  1. Measuring Massive Multitask Language Understanding. Hendrycks et al. 2021. [Paper]
  2. TruthfulQA: Measuring How Models Mimic Human Falsehoods. Lin et al. 2022. [Paper]
  3. HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models. Li et al. 2023. [Paper]
  4. C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models. Huang et al. 2023. [Paper]
  5. Do Large Language Models Know What They Don't Know?. Yin et al. 2023. [Paper]
  6. Do Large Language Models Know about Facts?. Hu et al. 2023. [Paper]
  7. RealTime QA: What's the Answer Right Now?. Kasai et al. 2022. [Paper]
  8. FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation. Vu et al. 2023. [Paper]
  9. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. Srivastava et al. 2023. [Paper]
  10. Natural Questions: a Benchmark for Question Answering Research. Kwiatkowski et al. 2019. [Paper]
  11. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. Joshi et al. 2017. [Paper]
  12. Semantic Parsing on Freebase from Question-Answer Pairs. Berant et al. 2013. [Paper]
  13. Open Question Answering over Tables and Text. Chen et al. 2021. [Paper]
  14. AmbigQA: Answering Ambiguous Open-domain Questions. Min et al. 2020. [Paper]
  15. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. Yang et al. 2018. [Paper]
  16. Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps. Ho et al. 2020. [Paper]
  17. IIRC: A Dataset of Incomplete Information Reading Comprehension Questions. Ferguson et al. 2020. [Paper]
  18. MuSiQue: Multihop Questions via Single-hop Question Composition. Trivedi et al. 2022. [Paper]
  19. ELI5: Long Form Question Answering. Fan et al. 2019. [Paper]
  20. FEVER: a large-scale dataset for Fact Extraction and VERification. Thorne et al. 2018. [Paper]
  21. Fool Me Twice: Entailment from Wikipedia Gamification. Eisenschlos et al. 2021. [Paper]
  22. HoVer: A Dataset for Many-Hop Fact Extraction And Claim Verification. Jiang et al. 2020. [Paper]
  23. The Fact Extraction and VERification Over Unstructured and Structured information (FEVEROUS) Shared Task. Aly et al. 2021. [Paper]
  24. T-REx: A Large Scale Alignment of Natural Language with Knowledge Base Triples. Elsahar et al. 2018. [Paper]
  25. Zero-Shot Relation Extraction via Reading Comprehension. Levy et al. 2017. [Paper]
  26. Language Models as Knowledge Bases?. Petroni et al. 2019. [Paper]
  27. Neural Text Generation from Structured Data with Application to the Biography Domain. Lebret et al. 2016. [Paper]
  28. WikiAsp: A Dataset for Multi-domain Aspect-based Summarization. Hayashi et al. 2021. [Paper]
  29. KILT: a Benchmark for Knowledge Intensive Language Tasks. Petroni et al. 2021. [Paper]
  30. Scaling Language Models: Methods, Analysis & Insights from Training Gopher. Rae et al. 2022. [Paper]
  31. Curation Corpus Base. Curation et al. 2020. [Paper]
  32. Pointer sentinel mixture models. Merity et al. 2016. [Paper]
  33. The LAMBADA dataset: Word prediction requiring a broad discourse context. Paperno et al. 2016. [Paper]
  34. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Raffel et al. 2020. [Paper]
  35. The Pile: An 800GB Dataset of Diverse Text for Language Modeling. Gao et al. 2020. [Paper]
  36. Wizard of Wikipedia: Knowledge-Powered Conversational agents. Dinan et al. 2019. [Paper]
  37. Grounded response generation task at dstc7. Galley et al. 2019. [Paper]
  38. "What do others think?": Task-Oriented Conversational Modeling with Subjective Knowledge. Zhao et al. 2023. [Paper]
  39. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. Gehman et al. 2020. [Paper]
  40. Hey AI, Can You Solve Complex Tasks by Talking to Agents?. Khot et al. 2022. [Paper]
  41. Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies. Geva et al. 2021. [Paper]
  42. TempQuestions: A Benchmark for Temporal Question Answering. Jia et al. 2018. [Paper]
  43. INFOTABS: Inference on Tables as Semi-structured Data. Gupta et al. 2020. [Paper]

Studies

  1. SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models. Manakul et al. 2023. [Paper]
  2. Evaluating Open Question Answering Evaluation. Wang et al. 2023. [Paper]
  3. Measuring and Modifying Factual Knowledge in Large Language Models. Pezeshkpour et al. 2023. [Paper]
  4. A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation. Varshney et al. 2023. [Paper]
  5. FacTool: Factuality Detection in Generative AI -- A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios. Chern et al. 2023. [Paper]
  6. Language Models (Mostly) Know What They Know. Kadavath et al. 2022. [Paper]
  7. Generate rather than retrieve: Large language models are strong context generators. Yu et al. 2023. [Paper]
  8. Beyond Factuality: A Comprehensive Evaluation of Large Language Models as Knowledge Generators. Chen et al. 2023. [Paper]
  9. Teaching language models to support answers with verified quotes. Menick et al. 2022. [Paper]

Evaluating Domain-specific Factuality

  1. PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark for Finance. Xie et al. 2023. [Paper]
  2. When flue meets flang: Benchmarks and large pre-trained language model for financial domain. Shah et al. 2022. [Paper]
  3. EcomGPT: Instruction-tuning Large Language Model with Chain-of-Task Tasks for E-commerce. Li et al. 2023. [Paper]
  4. CMB: A Comprehensive Medical Benchmark in Chinese. Wang et al. 2023. [Paper]
  5. Genegpt: Augmenting large language models with domain tools for improved access to biomedical information. Jin et al. 2023. [Paper]
  6. Legalbench: A collaboratively built benchmark for measuring legal reasoning in large language models. Guha et al. 2023. [Paper]
  7. LawBench: Benchmarking Legal Knowledge of Large Language Models. Fei et al. 2023. [Paper]

Factuality Enhancement

On Standalone LLM Generation

Pretraining-based

Initial Pretraining
  1. A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity. Yejin Bang et al. arXiv 2023. [Paper]
  2. Deduplicating Training Data Makes Language Models Better. Lee, Katherine et al. ACL 2022. [Paper]
  3. Unsupervised Improvement of Factual Knowledge in Language Models. Sadeq, Nafis et al. EACL 2023. [Paper]
Continual Pretraining
  1. Factuality Enhanced Language Models for Open-Ended Text Generation. Lee, Nayeon et al. NeurIPS 2022. [Paper]

Supervised Finetuning

Continual SFT
  1. SKILL: Structured Knowledge Infusion for Large Language Models. Moiseev, Fedor et al. NAACL 2022. [Paper]
  2. Contrastive Learning Reduces Hallucination in Conversations. Sun, Weiwei et al. AAAI 2023. [Paper]
  3. ChatGPT is not Enough: Enhancing Large Language Models with Knowledge Graphs for Fact-aware Language Modeling. Linyao Yang et al. arXiv 2023. [Paper]
Model Editing
  1. Editing Large Language Models: Problems, Methods, and Opportunities. Yunzhi Yao et al. arXiv 2023. [Paper]
  2. Knowledge Neurons in Pretrained Transformers. Dai, Damai et al. ACL 2022. [Paper]
  3. Locating and Editing Factual Associations in GPT. Kevin Meng et al. NeurIPS 2022. [Paper]
  4. Editing Factual Knowledge in Language Models. De Cao, Nicola et al. EMNLP 2021. [Paper]
  5. Fast Model Editing at Scale. Eric Mitchell et al. ICLR 2022. [Paper]
  6. Inference-Time Intervention: Eliciting Truthful Answers from a Language Model. Kenneth Li et al. arXiv 2023. [Paper]

Multi-Agent

  1. Improving Factuality and Reasoning in Language Models through Multiagent Debate. Yilun Du et al. arXiv 2023. [Paper]
  2. LM vs LM: Detecting Factual Errors via Cross Examination. Roi Cohen et al. arXiv 2023. [Paper]

Novel Prompt

  1. Generate Rather than Retrieve: Large Language Models are Strong Context Generators. Yu, Wenhao et al. ICLR 2023. [Paper]
  2. "According to ..." Prompting Language Models Improves Quoting from Pre-Training Data. Orion Weller et al. arXiv 2023. [Paper]
  3. Decomposed Prompting: A Modular Approach for Solving Complex Tasks. Tushar Khot et al. arXiv 2023. [Paper]
  4. Chain-of-Verification Reduces Hallucination in Large Language Models. Dhuliawala et al. arXiv 2023. [Paper]

Decoding

  1. Factuality Enhanced Language Models for Open-Ended Text Generation. Lee, Nayeon et al. NeurIPS 2022. [Paper]
  2. DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models. Chuang, Yung-Sung et al. arXiv 2023. [Paper]

On Retrieval-Augmented Generation

Normal RAG Setting

  1. Improving Language Models by Retrieving From Trillions of Tokens. Sebastian Borgeaud et al. arXiv 2021. [Paper]
  2. Internet-Augmented Language Models through Few-Shot Prompting for Open-Domain Question Answering. Angeliki Lazaridou et al. arXiv 2022. [Paper]

Interactive Retrieval

CoT-based Retrieval
  1. Rethinking with Retrieval: Faithful Large Language Model Inference. Hangfeng He et al. arXiv 2023. [Paper]
  2. Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions. Trivedi, Harsh et al. ACL 2023. [Paper]
  3. Active Retrieval Augmented Generation. Zhengbao Jiang et al. arXiv 2023. [Paper]
Agent-based Retrieval
  1. ReAct: Synergizing Reasoning and Acting in Language Models. Shunyu Yao et al. arXiv 2023. [Paper]
  2. Reflexion: Language Agents with Verbal Reinforcement Learning. Noah Shinn et al. arXiv 2023. [Paper]
  3. A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation. Neeraj Varshney et al. arXiv 2023. [Paper]

Retrieval Adaptation

Prompt-based
  1. Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback. Baolin Peng et al. arXiv 2023. [Paper]
  2. Knowledge-Augmented Language Model Verification. Jinheon Baek et al. EMNLP 2023. [Paper]
SFT-based
  1. Atlas: Few-shot Learning with Retrieval Augmented Language Models. Gautier Izacard et al. arXiv 2022. [Paper]
  2. REPLUG: Retrieval-Augmented Black-Box Language Models. Weijia Shi et al. arXiv 2023. [Paper]
  3. SAIL: Search-Augmented Instruction Learning. Luo, Hongyin et al. arXiv 2023. [Paper]
RLHF-based
  1. Teaching Language Models to Support Answers with Verified Quotes. Jacob Menick et al. arXiv 2022. [Paper]

Retrieval on External Memory

  1. Decoupled Context Processing for Context Augmented Language Modeling. Zonglin Li et al. NeurIPS 2022. [Paper]
  2. G-MAP: General Memory-Augmented Pre-trained Language Model for Domain Tasks. Zhongwei Wan et al. ICML 2019. [Paper]
  3. Parameter-Efficient Transfer Learning for NLP. Neil Houlsby et al. EMNLP 2022. [Paper]
  4. KALA: Knowledge-Augmented Language Model Adaptation. Kang, Minki et al. NAACL 2022. [Paper]
  5. Entities as Experts: Sparse Memory Access with Entity Supervision. Thibault Févry et al. EMNLP 2020. [Paper]
  6. Mention Memory: Incorporating Textual Knowledge into Transformers through Entity Mention Attention. Michiel de Jong et al. ICLR 2022. [Paper]
  7. Plug-and-Play Knowledge Injection for Pre-trained Language Models. Zhang, Zhengyan et al. ACL 2023. [Paper]
  8. Evidence-based Factual Error Correction. Thorne, James et al. ACL 2021. [Paper]
  9. Rarr: Researching and revising what language models say, using language models. Gao, Luyu et al. ACL 2023. [Paper]
  10. PURR: Efficiently Editing Language Model Hallucinations by Denoising Language Model Corruptions. Chen, Anthony et al. arXiv 2023. [Paper]

Retrieval on Structured Knowledge Source

  1. Mitigating Language Model Hallucination with Interactive Question-Knowledge Alignment. Shuo Zhang et al. arXiv 2023. [Paper]
  2. StructGPT: A general framework for Large Language Model to Reason on Structured Data. Jinhao Jiang et al. arXiv 2023. [Paper]
  3. Knowledge-Augmented Language Model Prompting for Zero-Shot Knowledge Graph Question Answering. Jinheon Baek et al. arXiv 2023. [Paper]

Domain Factuality Enhanced LLMs

Healthcare Domain-enhanced LLMs

  1. CohortGPT: An Enhanced GPT for Participant Recruitment in Clinical Study. Guan, Zihan et al. arXiv 2023. [paper]
  2. ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) Using Medical Domain Knowledge. Li, Yunxiang et al. Cureus 2023. [paper]
  3. Deid-GPT: Zero-Shot Medical Text De-Identification By Gpt-4. Liu, Zhengliang et al. arXiv 2023. [paper]
  4. Biomedlm: A Domain-Specific Large Language Model for Biomedical Text. Venigalla, A et al. [blog] [model]
  5. MedChatZH: A Better Medical Adviser Learns from Better Instructions. Tan, Yang et al. arXiv 2023. [paper]
  6. BioGPT: Generative Pre-Trained Transformer for Biomedical Text Generation and Mining. Luo, Renqian et al. Briefings in Bioinformatics 2022. [paper]
  7. Genegpt: Augmenting Large Language Models with Domain Tools for Improved Access to Biomedical Information. Jin, Qiao et al. arXiv 2023. [paper]
  8. Almanac: Retrieval-Augmented Language Models for Clinical Medicine. Hiesinger, William et al. arXiv 2023. [paper]
  9. MolXPT: Wrapping Molecules with Text for Generative Pre-training. Liu, Zequn et al. arXiv 2023. [paper]
  10. HuatuoGPT, Towards Taming Language Model to Be a Doctor. Zhang, Hongbo et al. arXiv 2023. [paper]
  11. Zhongjing: Enhancing the Chinese Medical Capabilities of Large Language Model through Expert Feedback and Real-world Multi-turn Dialogue. Yang, Songhua et al. arXiv 2023. [paper]
  12. Augmenting Black-box LLMs with Medical Textbooks for Clinical Question Answering. Wang, Yubo et al. arXiv 2023. [paper]
  13. DISC-MedLLM: Bridging General Large Language Models and Real-World Medical Consultation. Bao, Zhijie et al. arXiv 2023. [paper]

Legal Domain enhanced LLMs

  1. Brief Report on LawGPT 1.0: A Virtual Legal Assistant Based on GPT-3. Nguyen, Ha-Thanh et al. arXiv 2023. [paper]
  2. Chatlaw: Open-Source Legal Large Language Model with Integrated External Knowledge Bases. Cui, Jiaxi et al. arXiv 2023. [paper]
  3. Explaining Legal Concepts with Augmented Large Language Models (GPT-4). Savelka, Jaromir et al. arXiv 2023. [paper]
  4. Lawyer LLaMA Technical Report. Huang, Quzhe et al. arXiv 2023. [paper]

Finance Domain-enhanced LLMs

  1. EcomGPT: Instruction-tuning Large Language Model with Chain-of-Task Tasks for E-commerce. Li, Yangning et al. arXiv 2023. [paper]
  2. BloombergGPT: A Large Language Model for Finance. Shijie Wu et al. arXiv 2023. [paper]

Other Domain-Enhanced LLMs

Geoscience and Environment domain-enhanced LLMs
  1. Learning A Foundation Language Model for Geoscience Knowledge Understanding and Utilization. Deng, Cheng et al. arXiv 2023. [paper]
  2. HouYi: An Open-Source Large Language Model Specially Designed for Renewable Energy and Carbon Neutrality Field. Bai, Mingliang et al. arXiv 2023. [paper]
Education Domain-enhanced LLMs
  1. GrammarGPT: Exploring Open-Source LLMs for Native Chinese Grammatical Error Correction with Supervised Fine-Tuning. Fan, Yaxin et al. arXiv 2023. [paper]
Food Domain-enhanced LLMs
  1. FoodGPT: A Large Language Model in Food Testing Domain with Incremental Pre-training and Knowledge Graph Prompt. Qi, Zhixiao et al. arXiv 2023. [paper]
Home Renovation Domain-enhanced LLMs
  1. ChatHome: Development and Evaluation of a Domain-Specific Language Model for Home Renovation. Wen, Cheng et al. arXiv 2023. [paper]

Tables

Table: Comparison between the factuality issue and the hallucination issue.

Factual and Non-Hallucinated Factually correct outputs.
Non-Factual and Hallucinated Entirely fabricated outputs.
Hallucinated but Factual
1. Outputs that are unfaithful to the prompt but remain factually correct (cao-etal-2022-hallucinated).
2. Outputs that deviate from the prompt's specifics but don't touch on factuality, e.g., a prompt asking for a story about a rabbit and wolf becoming friends, but the LLM produces a tale about a rabbit and a dog befriending each other.
3. Outputs that provide additional factual details not specified in the prompt, e.g., a prompt asking about the capital of France, and the LLM responds with "Paris, which is known for the Eiffel Tower."
Non-Factual but Non-Hallucinated
1. Outputs where the LLM states, "I don't know," or avoids a direct answer.
2. Outputs that are partially correct, e.g., for the question, "Who landed on the moon with Apollo 11?" If the LLM responds with just "Neil Armstrong," the answer is incomplete but not hallucinated.
3. Outputs that provide a generalized or vague response without specific details, e.g., for a question about the causes of World War II, the LLM might respond with "It was due to various political and economic factors."

Causes of Factual Errors

Category Cause Example Dialog Notes and references
Model-level causes Domain knowledge deficit Q: CEO of Assicurazioni Generali?
BloombergGPT: Philippe Donnet
GPT-NeoX: Antonio De Lorenzo, Simone Gambarini, Enrico Zanetti
FLAN-T5-XXL: John M Forsyth, Christopher K Peters, {empty string}
BloombergGPT is a finance domain-specific language model. Wu et al, 2023
Outdated information Q: When was Kyiv attacked by Russia?
ChatGPT: As of my last knowledge update in September 2021, Russia had not launched an attack on Kyiv.
Kyiv was attacked by Russia on 25 February 2022.
Reasoning error Q: Who is Tom Cruise’s mother?
A: Mary Lee Pfeiffer
Q: Who is Mary Lee Pfeiffer’s son?
A: There is no widely known information about...
From Berglund et al, 2023. It is clear that the model knows Tom Cruise’s mother is Lee Pfeiffer, but it fails to reason that Lee Pfeiffer has a son named Tom Cruise.
Retrieval-level causes Distracted by the retrieval information Retrieved context: "Alan Melikdjanian - Age, Bio, Faces and Birthday"
Alan Melikdjanian is the Director, Writer (with his father), Editor, and Composer of the 2006 direct-to-DVD Russian-language film Citizen Mavzik...
Q: What is the nationality of the director of Citizen Mavzik?
A: (1) The director of the film Citizen Mavzik is Alan Melikdjanian. (2) Alan Melikdjanian is from Russia. So, the answer is Russia.
Alan Melikdjanian, known by the alias Captain Disillusion, is a Soviet-born American independent filmmaker and YouTuber. Melikdjanian et al, 2018
Misunderstood the retrieval information Retrieved context: "Gerard III, Count of Guelders - Wikipedia"
Gerard married Margaretha of Brabant...
Q: Who is Gerard I, Count Of Guelders's wife?
A: ... the name of Gerard I, Count of Guelders's wife, which is Margaretha of Brabant.
The retrieved context accurately provided the name of Gerard III, Count of Guelders' wife, which is Margaretha of Brabant. However, the model erroneously referred to him as Gerard I, resulting in an incorrect answer.
Fail to address misinformation in retrieved documents Retrieved context: [0] "James Vincent - Death Notice and Service Information - Legacy.com"
James Vincent died on August 9, 2023...
[1] "UPDATE: Jury convicts James Vincent of murder after deliberating..." A jury convicted James Vincent of murder in the 2018 Shooting death of Luka Grabb...
Q: When did James Vincent die?
A: James Vincent died on August 9, 2023, in Long Beach, California.
In the presence of contradictory information from retrieved documents, the model should thoroughly outline the response, while also noting the need for clarity. For instance, the model could state: "According to the first document, James Vincent passed away on August 9, 2023, in Long Beach. However, the second document suggests that James Vincent was murdered in 2018."
Inference-level causes Snowballing Q: Was there ever a US senator who represented the state of New Hampshire and whose alma mater was the University of Pennsylvania?
A: Yes... His name was John P. Hale
John P. Hale was graduated from Bowdoin College. Zhang et al, 2023
Exposure bias Q: Aamari was very stoic. [PRONOUN] rarely showed any emotion.
A: He.
The correct answer was Xe according to Hossain et al, 2023.

Evaluations

Reference Task Dataset Metrics Human Eval Evaluated LLMs Granularity
FActScore Min et al, 2023 Biography Generation 183 people entities F1 GPT-3.5,
ChatGPT...
T
SelfCheckGPT Manakul et al, 2023 Bio Generation WikiBio AUC-PR,
Human Score
GPT-3,
LLaMA,
OPT,
GPT-J...
S
Wang et al, 2023 Open QA NQ, TQ ACC,
EM
GPT-3.5,
ChatGPT,
GPT-4,
Bing Chat
S
Pezeshkpour et al, 2023 Knowledge Probing T-REx,
LAMA
ACC GPT3.5 T
De Cao et al, 2021 QA,
Fact Checking
KILT,
FEVER,
zsRE
ACC GPT-3,
FLAN-T5
S/T
Varshney et al, 2023 Article Generation Unnamed Dataset ACC,
AUC
GPT3.5,
Vicuna
S
FactTool Chern et al, 2023 KB-based QA RoSE ACC,
F1...
GPT-4,
ChatGPT,
FLAN-T5
S
Kadavath et al, 2022 Self-evaluation BIG Bench,
MMLU, LogiQA,
TruthfulQA,
QuALITY, TriviaQA Lambada
ACC,
Brier Score,
RMS Calibration Error...
Claude T
Reference Task Dataset Metrics Human Eval Evaluated LLMs Granularity
Retro Borgeaud et al, 2022 QA,
Language
Modeling
MassiveText,
Curation Corpus,
Wikitext103,
Lambada,
C4,Pile, NQ
PPL,
ACC,
Exact Match
Retro T
GenRead Yu et al, 2023 QA,
Dialogue,
Fact Checking
NQ, TQ, WebQ,
FEVER,
FM2, WoW
EM, ACC,
F1, Rouge-L
- GPT3.5, Codex
GPT-3, Gopher
FLAN, GLaM
PaLM
S
GopherCite Menick et al, 2022 Self-supported QA NQ, ELI5,
TruthfulQA
(Health, Law, Fiction, Conspiracies)
Human Score GopherCite S
Trivedi et al. Trivedi et al, 2023 QA HotpotQA, IIRC
2WikiMultihopQA,
MuSiQue(music)
Retrieval recall,
Answer F1
- GPT-3
FLAN-T5
S/T
Peng et al. Peng et al, 2023 QA,
Dialogue
DSTC7 track2
DSTC11 track5,
OTT-QA
ROUGE, chrF,
BERTScore, Usefulness,
Humanness...
ChatGPT S/T
CRITIC Gou et al, 2023 QA
Toxicity Reduction
AmbigNQ, TriviaQA, HotpotQA,
RealToxicityPrompts
Exact Match, maximum toxicity,
perplexity, n-gram diversity,
AUROC...,
- GPT-3.5
ChatGPT
T
Khot et al. Khot et al, 2023 QA,
long-context QA
CommaQA-E, 2WikiMultihopQA, MuSiQue, HotpotQA Exact Match, Answer F1 - GPT-3
FLAN-T5
T
ReAct Yao et al, 2023 QA
Fact Verification
HotpotQA, FEVER Exact Match, ACC - PaLM
GPT-3
S/T
Jiang et al. Jiang et al, 2023 QA, Commonsense Reasoning,
long-form QA...
2WikiMultihopQA, StrategyQA, ASQA, WikiAsp Exact Match, Disambig-F1, ROUGE,
entity F1...
- GPT-3.5 T
Lee et al. Lee et al, 2022 Open-ended Generation FEVER Entity score, EntailmentRatio, ppl... - Megatron-LM T
SAIL Luo et al, 2023 QA
Fact Checking
UniLC ACC
F1
- LLaMA Vicuna
SAIL
T
He et al. He et al, 2022 Commonsense Reasoning, Temporal Reasoning,
Tabular Reasoning
StrategyQA, TempQuestions, IN-FOTABS ACC - GPT-3 T
Pan et al. Pan et al, 2023 Fact Checking HOVER
FEVEROUS-S
Macro-F1 - Codex
FLAN-T5
S
Multiagent Debate Du et al, 2023 Biography
MMLU
Unnamed Biography Dataset,
MMLU
ChatGPT Evaluator, ACC - Bard
ChatGPT
S

Benchmarks

Reference Task Type Dataset Metrics Performance of Representative LLMs
MMLU Hendrycks et al, 2021 Multi-Choice QA Humanities,
Social,
Sciences,
STEM...
ACC (ACC, 5-shot)
GPT-4: 86.4
GPT-3.5: 70
LLaMA2-70B: 68.9
TruthfulQA Lin et al, 2022 QA Health, Law,
Conspiracies,
Fiction...
Human Score,
GPT-judge,
ROUGE, BLEU,
MC1,MC2...
(zero-shot)
GPT-4: ~29 (MC1)
GPT-3.5: ~28 (MC1),
79.92(%true)
LLaMA2-70B: 53.37 (%true)
C-Eval Huang et al, 2023 Multi-Choice QA STEM,
Social Science,
Humanities...
ACC (zero-shot, average ACC)
GPT-4: 68.7
GPT-3.5: 54.4
LLaMA2-70B: 50.13
AGIEval Zhong et al, 2023 Multi-Choice QA Gaokao, (geometry, Bio,
history...),SAT, Law...
ACC (zero-shot, average ACC)
GPT-4: 56.4
GPT-3.5: 42.9
LLaMA2-70B: 40.02
HaluEval Li et al, 2023 Hallucination Evaluation HaluEval ACC (general ACC)
GPT-3.5: 86.22
BigBench Srivastava et al, 2023 Multi-tasks(QA, NLI, Fact Checking, Reasoning...) BigBench Metric to each type of task (Big-Bench Hard)
GPT-3.5: 49.6
LLaMA-65B: 42.6
ALCE Gao et al, 2023 Citation Generation ASQA, ELI5,
QAMPARI
MAUVE, Exact Match, ROUGE-L... (ASQA, 3-psg, citation prec)
GPT-3.5: 73.9
LLaMA-33B: 23.0
QUIP Weller et al, 2023 Generative QA TriviaQA,
NQ, ELI5,
HotpotQA
QUIP-Score, Exact match (ELI5, QUIP, null prompt)
GPT-4: 21.0
GPT-3.5: 27.7
PopQA Mallen et al, 2023 Multi-Choice QA PopQA,
EntityQuestions
ACC (overall ACC)
GPT-3.5: ~37.0
UniLC Zhang et al, 2023 Fact Checking Climate,
Health, MGFN
ACC, F1 (zero-shot, fact tasks, average F1)
GPT-3.5: 51.62
Pinocchio Hu et al, 2023 Fact Checking, QA, Reasoning Pinocchio ACC, F1 GPT-3.5: (Zero-shot ACC: 46.8, F1:44.4)
GPT-3.5: (Few-shot ACC: 47.1, F1:45.7)
SelfAware Yin et al, 2023 Self-evaluation SelfAware ACC (instruction input, F1)
GPT-4: 75.47
GPT-3.5: 51.43
LLaMA-65B: 46.89
RealTimeQA Kasai et al, 2022 Multi-Choice QA, Generative QA RealTimeQA ACC, F1 (original setting, GCS retrieval)
GPT-3: 69.3 (ACC for MC)
GPT-3: 39.4 (F1 for generation)
FreshQA Vu et al, 2023 Generative QA FRESHQA ACC (Human) (strict ACC, null prompt)
GPT-4: 28.6
GPT-3.5: 26.0

Domain evaluation

Reference Domain Task Datasets Metrics Evaluated LLMs
Xie et al, 2023 Finance Sentiment analysis,
News headline classification,
Named entity recognition,
Question answering,
Stock movement prediction
FLARE F1, Acc,
Avg F1,
Entity F1,
EM, MCC
GPT-4 ,
BloombergGPT,
FinMA-(7B, 30B, 7B-full),
Vicuna-7B
Li et al, 2023 Finance 134 E-com tasks EcomInstruct Micro-F1,
Macro-F1,
ROUGE
BLOOM, BLOOMZ,
ChatGPT, EcomGPT
Wang et al, 2023 Medicine Multi-Choice QA CMB Acc GPT-4, ChatGLM2-6B,
ChatGPT, DoctorGLM,
Baichuan-13B-chat,
HuatuoGPT, MedicalGPT,
ChatMed-Consult,
ChatGLM-Med ,
Bentsao, BianQue-2
Li et al, 2023 Medicine Generative-QA Huatuo-26M BLEU,
ROUGE,
GLEU
T5, GPT2
Jin et al, 2023 Medicine Nomenclature,
Genomic location,
Functional analysis,
Sequence alignment
GeneTuring Acc GPT-2, BioGPT,
BioMedLM,
GPT-3,
ChatGPT, New Bing
Guha et al, 2023 Law Issue-spotting,
Rule-recall,
Rule-application,
Rule-conclusion,
Interpretation,
Rhetorical-understanding
LegalBench Acc, EM GPT-4, GPT-3.5,
Claude-1, Incite, OPT
Falcon, LLaMA-2, FLAN-T5...
Fei et al, 2023 Law Legal QA, NER,
Sentiment Analysis,
Reading Comprehension
LawBench F1, Acc,
ROUGE-L,
Normalized log-distance...
GPT-4,
ChatGPT,
InternLM-Chat,
StableBeluga2...

Enhancement

Enhancement methods

Reference Dataset Metrics Baselines ➝ Theirs Dataset Metrics Baselines ➝ Theirs
Li et al, 2022 NQ EM 34.5 ➝ 44.35 (T5 11B) GSM8K ACC 77.0 ➝ 85.0 (ChatGPT)
Yu et al, 2023 NQ EM 20.9 ➝ 28.0 (InstructGPT) TriviaQA EM 57.5 ➝ 59.0 (InstructGPT)
- - - - WebQA EM 18.6 ➝ 24.6 (InstructGPT)
Chuang et al, 2023 FACTOR News ACC 58.3 ➝ 62.0 (LLaMa-7B) FACTOR News ACC 61.1 ➝ 62.5 (LLaMa-13B)
- FACTOR News ACC 63.8 ➝ 65.4 (LLaMa-33B) FACTOR News ACC 63.6 ➝ 66.2 (LLaMa-65B)
- FACTOR Wiki ACC 58.6 ➝ 62.2 (LLaMa-7B) FACTOR Wiki ACC 62.6 ➝ 66.2 (LLaMa-13B)
- FACTOR Wiki ACC 69.5 ➝ 70.3 (LLaMa-33B) FACTOR Wiki ACC 72.2 ➝ 72.4 (LLaMa-65B)
- TruthfulQA %Truth * Info 32.4 ➝ 44.6 (LLaMa-13B) TruthfulQA %Truth * Info 34.8 ➝ 49.2 (LLaMa-65B)
Li et al, 2022 TruthfulQA %Truth * Info 32.4 ➝ 44.4 (LLaMa-13B) TruthfulQA %Truth * Info 31.7 ➝ 36.7 (LLaMa-33B)
- TruthfulQA %Truth * Info 34.8 ➝ 43.4 (LLaMa-65B) - - -
Li et al, 2023 NQ ACC 46.6 ➝ 51.3 (LLaMA-7B) TriviaQA ACC 89.6 ➝ 91.1 (LLaMA-7B)
- MMLU ACC 35.7 ➝ 40.1 (LLaMA-7B) TruthfulQA %Truth * Info 32.5 ➝ 65.1 (Alpaca)
- TruthfulQA %Truth * Info 26.9 ➝ 43.5 (LLaMa-7B) TruthfulQA %Truth * Info 51.5 ➝ 74.0 (Vicuna)
Cohen et al, 2023 LAMA F1 50.7 ➝ 80.8 (ChatGPT) TriviaQA F1 56.2 ➝ 82.6 (ChatGPT)
- NQ F1 60.6 ➝ 79.1 (ChatGPT) PopQA F1 65.2 ➝ 85.4 (ChatGPT)
- LAMA F1 42.5 ➝ 79.3 (GPT-3) TriviaQA F1 46.7 ➝ 77.2 (GPT-3)
- NQ F1 52.0 ➝ 78.0 (GPT-3) PopQA F1 43.7 ➝ 77.4 (GPT-3)
...

Domain-enhanced LLMs

Reference Domain Model Eval Task Eval Dataset Continual Pretrained? Continual SFT? Train From Scratch? External Knowledge
Zhang et al, 2023 Healthcare Baichuan-7B, Ziya-LLaMA-13B QA cMedQA2, WebMedQA, Huatuo-26M ✔️
Yang et al, 2023 Healthcare Ziya-LLaMA-13B QA CMtMedQA, huatuo-26M ✔️ ✔️
Wang et al, 2023 Healthcare GPT-3.5-Turbo, LLaMA-2-13B QA MedQAUSMLE, MedQAMCMLE, MedMCQA ✔️
Ross et al, 2022 Healthcare MOLFORMER Molecule properties prediction ✔️
Bao et al, 2023 Healthcare Baichuan-13B QA CMB-Clin, CMD, CMID ✔️
Guan et al, 2023 Healthcare ChatGPT IU-RR, MIMIC-CXR ✔️
Liu et al, 2023 Healthcare GPT-4 Medical Text De-Identification ✔️
Li et al, 2023 Healthcare LLaMA QA ✔️
Venigalla et al, 2022 Healthcare GPT (2.7b) QA ✔️
Xiong et al, 2023 Healthcare ChatGLM-6B QA ✔️
Tan et al, 2023 Healthcare Baichuan-7B QA C-Eval, MMLU ✔️
Luo et al, 2022 Healthcare GPT-2 QA, DC, RE ✔️
Jin et al, 2023 Healthcare Codex QA GeneTuring ✔️
Zakka et al, 2023 Healthcare text-davinci-003 QA ClinicalQA ✔️
Liu et al, 2023 Healthcare GPT-2medium Molecular Property Prediction, Molecule-text translation ✔️ ✔️
Nguyen et al, 2023 Law GPT3 ✔️
Savelka et al, 2023 Law GPT-4 ✔️
Huang et al, 2023 Law LLaMA CN Legal Tasks ✔️ ✔️
Cui et al, 2023 Law Ziya-LLaMA-13B QA national judicial examination question ✔️ ✔️
Li et al, 2023 Finance BLOOMZ 4 major tasks 12 subtasks EcomInstruct ✔️
Wu et al, 2023 Finance BLOOM Financial NLP (SA, BC, NER, NER+NED, QA) Financial Datasets ✔️
Deng et al, 2023 Geoscience LLaMA-7B GeoBench ✔️
Bai et al, 2023 Geoscience ChatGLM-6B ✔️
Fan et al, 2023 Education phoenix-inst-chat-7b Chinese Grammatical Error Correction ChatGPT-generated, Human-annotated ✔️
Qi et al, 2023 Food Chinese-LLaMA2-13B QA ✔️ ✔️
Wen et al, 2023 Home Renovation Baichuan-13B C-Eval, CMMLU, EvalHome ✔️

Reference

If you find this project useful in your research or work, please consider citing it:

@misc{wang2023survey,
      title={Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity}, 
      author={Cunxiang Wang and Xiaoze Liu and Yuanhao Yue and Xiangru Tang and Tianhang Zhang and Cheng Jiayang and Yunzhi Yao and Wenyang Gao and Xuming Hu and Zehan Qi and Yidong Wang and Linyi Yang and Jindong Wang and Xing Xie and Zheng Zhang and Yue Zhang},
      year={2023},
      eprint={2310.07521},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Acknowledgements

  1. CHEN Liang (ChanLiang) for PR#1.
  2. JinheonBaek (JinheonBaek) for PR#2 and PR#3

Star History

Star History Chart