The repository for the survey paper "Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity"
Cunxiang Wang1,7*, Xiaoze Liu2*, Yuanhao Yue3*, Qipeng Guo4, Xiangkun Hu4, Xiangru Tang5, Tianhang Zhang6, Cheng Jiayang7, Yunzhi Yao8, Wenyang Gao1,8, Xuming Hu9, Zehan Qi9, Yidong Wang1, Linyi Yang1, Jindong Wang10, Xing Xie10, Zheng Zhang4,11 and Yue Zhang1.
1. School of Engineering, Westlake University; 2. Purdue University; 3. Fudan University; 4. Amazon AWS AI Lab; 5. Yale University; 6. Shanghai Jiao Tong University; 7. HKUST; 8. Zhejiang University; 9. Tsinghua University; 10. Microsoft Research; 11. NYU Shanghai;
(*: Equal Contribution; Correspondence to: Yue Zhang)
NOTE: As real-time updates may not be feasible for the arXiv paper. For the most recent developments and modifications, please consult this repository. We greatly appreciate and welcome pull requests or issues to enhance the quality of this survey. All contributions will be list in the acknowledgements section.
- Language Models as Knowledge Bases?. Petroni et al. 2019. [Paper]
- Locating and Editing Factual Associations in GPT. Meng et al. 2022. [Paper]
- Transformer Feed-Forward Layers Are Key-Value Memories. Geva et al. 2021. [Paper]
- Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space. Geva et al. 2022. [Paper]
- Dissecting Recall of Factual Associations in Auto-Regressive Language Models. Globerson et al. 2023. [Paper]
- Journey to the Center of the Knowledge Neurons: Discoveries of Language-Independent Knowledge Neurons and Degenerate Knowledge Neurons. Chen et al. 2023. [Paper]
- A rigorous study of integrated gradients method and extensions to internal neuron attributions. Lundstrom et al. 2022. [Paper]
- CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing. Gou et al. 2023. [Paper]
- Investigating the Factual Knowledge Boundary of Large Language Models with Retrieval Augmentation. Ren et al. 2023. [Paper]
- Do Large Language Models Know What They Don't Know?. Yin et al. 2023. [Paper]
- A Survey on In-context Learning. Dong et al. 2023. [Paper]
- Language Models (Mostly) Know What They Know. Kadavath et al. 2022. [Paper]
- The internal state of an llm knows when its lying. Azaria et al. 2023. [Paper]
- Generate rather than retrieve: Large language models are strong context generators. Yu et al. 2023. [Paper]
- Beyond Factuality: A Comprehensive Evaluation of Large Language Models as Knowledge Generators. Chen et al. 2023. [Paper]
- Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering. Izacard et al. 2021. [Paper]
- Large language models struggle to learn long-tail knowledge. Kandpal et al. 2023. [Paper]
- Head-to-Tail: How Knowledgeable are Large Language Models (LLM)? AKA Will LLMs Replace Knowledge Graphs?. Sun et al. 2023. [Paper]
- Large Language Models with Controllable Working Memory. Li et al. 2023. [Paper]
- Context-faithful Prompting for Large Language Models. Zhou et al. 2023. [Paper]
- Benchmarking Large Language Models in Retrieval-Augmented Generation. Chen et al. 2023. [Paper]
- Automatic Evaluation of Attribution by Large Language Models. Yue et al. 2023. [Paper]
- Entity-Based Knowledge Conflicts in Question Answering. Longpre et al. 2021. [Paper]
- Rich Knowledge Sources Bring Complex Knowledge Conflicts: Recalibrating Models to Reflect Conflicting Evidence. Chen et al. 2022. [Paper]
- Adaptive Chameleon or Stubborn Sloth: Unraveling the Behavior of Large Language Models in Knowledge Clashes. Xie et al. 2023. [Paper]
- Large Language Models with Controllable Working Memory. Li et al. 2023. [Paper]
- An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks. Goodfellow et al. 2015. [Paper]
- Preserving In-Context Learning ability in Large Language Model Fine-tuning. Wang et al. 2022. [Paper]
- Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less Forgetting. Chen et al. 2020. [Paper]
- Investigating the Catastrophic Forgetting in Multimodal Large Language Models. Zhai et al. 2023. [Paper]
- An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning. Luo et al. 2023. [Paper]
- We're Afraid Language Models Aren't Modeling Ambiguity. Liu et al. 2023. [Paper]
- The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A". Berglund et al. 2023. [Paper]
- Understanding Catastrophic Forgetting in Language Models via Implicit Inference. Kotha et al. 2023. [Paper]
- Can ChatGPT Replace Traditional KBQA Models? An In-depth Analysis of the Question Answering Performance of the GPT LLM Family. Tan et al. 2023. [Paper]
- Entity-Based Knowledge Conflicts in Question Answering. Longpre et al. 2021. [Paper]
- On the Risk of Misinformation Pollution with Large Language Models. Pan et al. 2023. [Paper]
- A Survey on Truth Discovery. Han et al. 2015. [Paper]
- SAIL: Search-Augmented Instruction Learning. Luo et al. 2023. [Paper]
- Lost in the middle: How language models use long contexts. Liu et al. 2023. [Paper]
- ReAct: Synergizing Reasoning and Acting in Language Models. Yao et al. 2023. [Paper]
- How language model hallucinations can snowball. Zhang et al. 2023. [Paper]
- A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation. Varshney et al. 2023. [Paper]
- DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models. Chuang et al. 2023. [Paper]
- How Decoding Strategies Affect the Verifiability of Generated Text. Massarelli et al. 2020. [Paper]
- WinoQueer: A Community-in-the-Loop Benchmark for Anti-LGBTQ+ Bias in Large Language Models. Felkner et al. 2023. [Paper]
- Bias and Fairness in Large Language Models: A Survey. Gallegos et al. 2023. [Paper]
- MISGENDERED: Limits of Large Language Models in Understanding Pronouns. Hossain et al. 2023. [Paper]
- Measuring Massive Multitask Language Understanding. Hendrycks et al. 2021. [Paper]
- TruthfulQA: Measuring How Models Mimic Human Falsehoods. Lin et al. 2022. [Paper]
- HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models. Li et al. 2023. [Paper]
- C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models. Huang et al. 2023. [Paper]
- Do Large Language Models Know What They Don't Know?. Yin et al. 2023. [Paper]
- Do Large Language Models Know about Facts?. Hu et al. 2023. [Paper]
- RealTime QA: What's the Answer Right Now?. Kasai et al. 2022. [Paper]
- FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation. Vu et al. 2023. [Paper]
- Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. Srivastava et al. 2023. [Paper]
- Natural Questions: a Benchmark for Question Answering Research. Kwiatkowski et al. 2019. [Paper]
- TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. Joshi et al. 2017. [Paper]
- Semantic Parsing on Freebase from Question-Answer Pairs. Berant et al. 2013. [Paper]
- Open Question Answering over Tables and Text. Chen et al. 2021. [Paper]
- AmbigQA: Answering Ambiguous Open-domain Questions. Min et al. 2020. [Paper]
- HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. Yang et al. 2018. [Paper]
- Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps. Ho et al. 2020. [Paper]
- IIRC: A Dataset of Incomplete Information Reading Comprehension Questions. Ferguson et al. 2020. [Paper]
- MuSiQue: Multihop Questions via Single-hop Question Composition. Trivedi et al. 2022. [Paper]
- ELI5: Long Form Question Answering. Fan et al. 2019. [Paper]
- FEVER: a large-scale dataset for Fact Extraction and VERification. Thorne et al. 2018. [Paper]
- Fool Me Twice: Entailment from Wikipedia Gamification. Eisenschlos et al. 2021. [Paper]
- HoVer: A Dataset for Many-Hop Fact Extraction And Claim Verification. Jiang et al. 2020. [Paper]
- The Fact Extraction and VERification Over Unstructured and Structured information (FEVEROUS) Shared Task. Aly et al. 2021. [Paper]
- T-REx: A Large Scale Alignment of Natural Language with Knowledge Base Triples. Elsahar et al. 2018. [Paper]
- Zero-Shot Relation Extraction via Reading Comprehension. Levy et al. 2017. [Paper]
- Language Models as Knowledge Bases?. Petroni et al. 2019. [Paper]
- Neural Text Generation from Structured Data with Application to the Biography Domain. Lebret et al. 2016. [Paper]
- WikiAsp: A Dataset for Multi-domain Aspect-based Summarization. Hayashi et al. 2021. [Paper]
- KILT: a Benchmark for Knowledge Intensive Language Tasks. Petroni et al. 2021. [Paper]
- Scaling Language Models: Methods, Analysis & Insights from Training Gopher. Rae et al. 2022. [Paper]
- Curation Corpus Base. Curation et al. 2020. [Paper]
- Pointer sentinel mixture models. Merity et al. 2016. [Paper]
- The LAMBADA dataset: Word prediction requiring a broad discourse context. Paperno et al. 2016. [Paper]
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Raffel et al. 2020. [Paper]
- The Pile: An 800GB Dataset of Diverse Text for Language Modeling. Gao et al. 2020. [Paper]
- Wizard of Wikipedia: Knowledge-Powered Conversational agents. Dinan et al. 2019. [Paper]
- Grounded response generation task at dstc7. Galley et al. 2019. [Paper]
- "What do others think?": Task-Oriented Conversational Modeling with Subjective Knowledge. Zhao et al. 2023. [Paper]
- RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. Gehman et al. 2020. [Paper]
- Hey AI, Can You Solve Complex Tasks by Talking to Agents?. Khot et al. 2022. [Paper]
- Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies. Geva et al. 2021. [Paper]
- TempQuestions: A Benchmark for Temporal Question Answering. Jia et al. 2018. [Paper]
- INFOTABS: Inference on Tables as Semi-structured Data. Gupta et al. 2020. [Paper]
- SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models. Manakul et al. 2023. [Paper]
- Evaluating Open Question Answering Evaluation. Wang et al. 2023. [Paper]
- Measuring and Modifying Factual Knowledge in Large Language Models. Pezeshkpour et al. 2023. [Paper]
- A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation. Varshney et al. 2023. [Paper]
- FacTool: Factuality Detection in Generative AI -- A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios. Chern et al. 2023. [Paper]
- Language Models (Mostly) Know What They Know. Kadavath et al. 2022. [Paper]
- Generate rather than retrieve: Large language models are strong context generators. Yu et al. 2023. [Paper]
- Beyond Factuality: A Comprehensive Evaluation of Large Language Models as Knowledge Generators. Chen et al. 2023. [Paper]
- Teaching language models to support answers with verified quotes. Menick et al. 2022. [Paper]
- PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark for Finance. Xie et al. 2023. [Paper]
- When flue meets flang: Benchmarks and large pre-trained language model for financial domain. Shah et al. 2022. [Paper]
- EcomGPT: Instruction-tuning Large Language Model with Chain-of-Task Tasks for E-commerce. Li et al. 2023. [Paper]
- CMB: A Comprehensive Medical Benchmark in Chinese. Wang et al. 2023. [Paper]
- Genegpt: Augmenting large language models with domain tools for improved access to biomedical information. Jin et al. 2023. [Paper]
- Legalbench: A collaboratively built benchmark for measuring legal reasoning in large language models. Guha et al. 2023. [Paper]
- LawBench: Benchmarking Legal Knowledge of Large Language Models. Fei et al. 2023. [Paper]
- A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity. Yejin Bang et al. arXiv 2023. [Paper]
- Deduplicating Training Data Makes Language Models Better. Lee, Katherine et al. ACL 2022. [Paper]
- Unsupervised Improvement of Factual Knowledge in Language Models. Sadeq, Nafis et al. EACL 2023. [Paper]
- Factuality Enhanced Language Models for Open-Ended Text Generation. Lee, Nayeon et al. NeurIPS 2022. [Paper]
- SKILL: Structured Knowledge Infusion for Large Language Models. Moiseev, Fedor et al. NAACL 2022. [Paper]
- Contrastive Learning Reduces Hallucination in Conversations. Sun, Weiwei et al. AAAI 2023. [Paper]
- ChatGPT is not Enough: Enhancing Large Language Models with Knowledge Graphs for Fact-aware Language Modeling. Linyao Yang et al. arXiv 2023. [Paper]
- Editing Large Language Models: Problems, Methods, and Opportunities. Yunzhi Yao et al. arXiv 2023. [Paper]
- Knowledge Neurons in Pretrained Transformers. Dai, Damai et al. ACL 2022. [Paper]
- Locating and Editing Factual Associations in GPT. Kevin Meng et al. NeurIPS 2022. [Paper]
- Editing Factual Knowledge in Language Models. De Cao, Nicola et al. EMNLP 2021. [Paper]
- Fast Model Editing at Scale. Eric Mitchell et al. ICLR 2022. [Paper]
- Inference-Time Intervention: Eliciting Truthful Answers from a Language Model. Kenneth Li et al. arXiv 2023. [Paper]
- Improving Factuality and Reasoning in Language Models through Multiagent Debate. Yilun Du et al. arXiv 2023. [Paper]
- LM vs LM: Detecting Factual Errors via Cross Examination. Roi Cohen et al. arXiv 2023. [Paper]
- Generate Rather than Retrieve: Large Language Models are Strong Context Generators. Yu, Wenhao et al. ICLR 2023. [Paper]
- "According to ..." Prompting Language Models Improves Quoting from Pre-Training Data. Orion Weller et al. arXiv 2023. [Paper]
- Decomposed Prompting: A Modular Approach for Solving Complex Tasks. Tushar Khot et al. arXiv 2023. [Paper]
- Chain-of-Verification Reduces Hallucination in Large Language Models. Dhuliawala et al. arXiv 2023. [Paper]
- Factuality Enhanced Language Models for Open-Ended Text Generation. Lee, Nayeon et al. NeurIPS 2022. [Paper]
- DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models. Chuang, Yung-Sung et al. arXiv 2023. [Paper]
- Improving Language Models by Retrieving From Trillions of Tokens. Sebastian Borgeaud et al. arXiv 2021. [Paper]
- Internet-Augmented Language Models through Few-Shot Prompting for Open-Domain Question Answering. Angeliki Lazaridou et al. arXiv 2022. [Paper]
- Rethinking with Retrieval: Faithful Large Language Model Inference. Hangfeng He et al. arXiv 2023. [Paper]
- Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions. Trivedi, Harsh et al. ACL 2023. [Paper]
- Active Retrieval Augmented Generation. Zhengbao Jiang et al. arXiv 2023. [Paper]
- ReAct: Synergizing Reasoning and Acting in Language Models. Shunyu Yao et al. arXiv 2023. [Paper]
- Reflexion: Language Agents with Verbal Reinforcement Learning. Noah Shinn et al. arXiv 2023. [Paper]
- A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation. Neeraj Varshney et al. arXiv 2023. [Paper]
- Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback. Baolin Peng et al. arXiv 2023. [Paper]
- Knowledge-Augmented Language Model Verification. Jinheon Baek et al. EMNLP 2023. [Paper]
- Atlas: Few-shot Learning with Retrieval Augmented Language Models. Gautier Izacard et al. arXiv 2022. [Paper]
- REPLUG: Retrieval-Augmented Black-Box Language Models. Weijia Shi et al. arXiv 2023. [Paper]
- SAIL: Search-Augmented Instruction Learning. Luo, Hongyin et al. arXiv 2023. [Paper]
- Teaching Language Models to Support Answers with Verified Quotes. Jacob Menick et al. arXiv 2022. [Paper]
- Decoupled Context Processing for Context Augmented Language Modeling. Zonglin Li et al. NeurIPS 2022. [Paper]
- G-MAP: General Memory-Augmented Pre-trained Language Model for Domain Tasks. Zhongwei Wan et al. ICML 2019. [Paper]
- Parameter-Efficient Transfer Learning for NLP. Neil Houlsby et al. EMNLP 2022. [Paper]
- KALA: Knowledge-Augmented Language Model Adaptation. Kang, Minki et al. NAACL 2022. [Paper]
- Entities as Experts: Sparse Memory Access with Entity Supervision. Thibault Févry et al. EMNLP 2020. [Paper]
- Mention Memory: Incorporating Textual Knowledge into Transformers through Entity Mention Attention. Michiel de Jong et al. ICLR 2022. [Paper]
- Plug-and-Play Knowledge Injection for Pre-trained Language Models. Zhang, Zhengyan et al. ACL 2023. [Paper]
- Evidence-based Factual Error Correction. Thorne, James et al. ACL 2021. [Paper]
- Rarr: Researching and revising what language models say, using language models. Gao, Luyu et al. ACL 2023. [Paper]
- PURR: Efficiently Editing Language Model Hallucinations by Denoising Language Model Corruptions. Chen, Anthony et al. arXiv 2023. [Paper]
- Mitigating Language Model Hallucination with Interactive Question-Knowledge Alignment. Shuo Zhang et al. arXiv 2023. [Paper]
- StructGPT: A general framework for Large Language Model to Reason on Structured Data. Jinhao Jiang et al. arXiv 2023. [Paper]
- Knowledge-Augmented Language Model Prompting for Zero-Shot Knowledge Graph Question Answering. Jinheon Baek et al. arXiv 2023. [Paper]
- CohortGPT: An Enhanced GPT for Participant Recruitment in Clinical Study. Guan, Zihan et al. arXiv 2023. [paper]
- ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) Using Medical Domain Knowledge. Li, Yunxiang et al. Cureus 2023. [paper]
- Deid-GPT: Zero-Shot Medical Text De-Identification By Gpt-4. Liu, Zhengliang et al. arXiv 2023. [paper]
- Biomedlm: A Domain-Specific Large Language Model for Biomedical Text. Venigalla, A et al. [blog] [model]
- MedChatZH: A Better Medical Adviser Learns from Better Instructions. Tan, Yang et al. arXiv 2023. [paper]
- BioGPT: Generative Pre-Trained Transformer for Biomedical Text Generation and Mining. Luo, Renqian et al. Briefings in Bioinformatics 2022. [paper]
- Genegpt: Augmenting Large Language Models with Domain Tools for Improved Access to Biomedical Information. Jin, Qiao et al. arXiv 2023. [paper]
- Almanac: Retrieval-Augmented Language Models for Clinical Medicine. Hiesinger, William et al. arXiv 2023. [paper]
- MolXPT: Wrapping Molecules with Text for Generative Pre-training. Liu, Zequn et al. arXiv 2023. [paper]
- HuatuoGPT, Towards Taming Language Model to Be a Doctor. Zhang, Hongbo et al. arXiv 2023. [paper]
- Zhongjing: Enhancing the Chinese Medical Capabilities of Large Language Model through Expert Feedback and Real-world Multi-turn Dialogue. Yang, Songhua et al. arXiv 2023. [paper]
- Augmenting Black-box LLMs with Medical Textbooks for Clinical Question Answering. Wang, Yubo et al. arXiv 2023. [paper]
- DISC-MedLLM: Bridging General Large Language Models and Real-World Medical Consultation. Bao, Zhijie et al. arXiv 2023. [paper]
- Brief Report on LawGPT 1.0: A Virtual Legal Assistant Based on GPT-3. Nguyen, Ha-Thanh et al. arXiv 2023. [paper]
- Chatlaw: Open-Source Legal Large Language Model with Integrated External Knowledge Bases. Cui, Jiaxi et al. arXiv 2023. [paper]
- Explaining Legal Concepts with Augmented Large Language Models (GPT-4). Savelka, Jaromir et al. arXiv 2023. [paper]
- Lawyer LLaMA Technical Report. Huang, Quzhe et al. arXiv 2023. [paper]
- EcomGPT: Instruction-tuning Large Language Model with Chain-of-Task Tasks for E-commerce. Li, Yangning et al. arXiv 2023. [paper]
- BloombergGPT: A Large Language Model for Finance. Shijie Wu et al. arXiv 2023. [paper]
- Learning A Foundation Language Model for Geoscience Knowledge Understanding and Utilization. Deng, Cheng et al. arXiv 2023. [paper]
- HouYi: An Open-Source Large Language Model Specially Designed for Renewable Energy and Carbon Neutrality Field. Bai, Mingliang et al. arXiv 2023. [paper]
- GrammarGPT: Exploring Open-Source LLMs for Native Chinese Grammatical Error Correction with Supervised Fine-Tuning. Fan, Yaxin et al. arXiv 2023. [paper]
- FoodGPT: A Large Language Model in Food Testing Domain with Incremental Pre-training and Knowledge Graph Prompt. Qi, Zhixiao et al. arXiv 2023. [paper]
- ChatHome: Development and Evaluation of a Domain-Specific Language Model for Home Renovation. Wen, Cheng et al. arXiv 2023. [paper]
Table: Comparison between the factuality issue and the hallucination issue.
Factual and Non-Hallucinated | Factually correct outputs. |
Non-Factual and Hallucinated | Entirely fabricated outputs. |
Hallucinated but Factual |
1. Outputs that are unfaithful to the prompt but remain factually correct (cao-etal-2022-hallucinated). 2. Outputs that deviate from the prompt's specifics but don't touch on factuality, e.g., a prompt asking for a story about a rabbit and wolf becoming friends, but the LLM produces a tale about a rabbit and a dog befriending each other. 3. Outputs that provide additional factual details not specified in the prompt, e.g., a prompt asking about the capital of France, and the LLM responds with "Paris, which is known for the Eiffel Tower." |
Non-Factual but Non-Hallucinated |
1. Outputs where the LLM states, "I don't know," or avoids a direct answer. 2. Outputs that are partially correct, e.g., for the question, "Who landed on the moon with Apollo 11?" If the LLM responds with just "Neil Armstrong," the answer is incomplete but not hallucinated. 3. Outputs that provide a generalized or vague response without specific details, e.g., for a question about the causes of World War II, the LLM might respond with "It was due to various political and economic factors." |
Category | Cause | Example Dialog | Notes and references |
---|---|---|---|
Model-level causes | Domain knowledge deficit | Q: CEO of Assicurazioni Generali? BloombergGPT: Philippe Donnet GPT-NeoX: Antonio De Lorenzo, Simone Gambarini, Enrico Zanetti FLAN-T5-XXL: John M Forsyth, Christopher K Peters, {empty string} |
BloombergGPT is a finance domain-specific language model. Wu et al, 2023 |
Outdated information | Q: When was Kyiv attacked by Russia? ChatGPT: As of my last knowledge update in September 2021, Russia had not launched an attack on Kyiv. |
Kyiv was attacked by Russia on 25 February 2022. | |
Reasoning error | Q: Who is Tom Cruise’s mother? A: Mary Lee Pfeiffer Q: Who is Mary Lee Pfeiffer’s son? A: There is no widely known information about... |
From Berglund et al, 2023. It is clear that the model knows Tom Cruise’s mother is Lee Pfeiffer, but it fails to reason that Lee Pfeiffer has a son named Tom Cruise. | |
Retrieval-level causes | Distracted by the retrieval information | Retrieved context: "Alan Melikdjanian - Age, Bio, Faces and Birthday" Alan Melikdjanian is the Director, Writer (with his father), Editor, and Composer of the 2006 direct-to-DVD Russian-language film Citizen Mavzik... Q: What is the nationality of the director of Citizen Mavzik? A: (1) The director of the film Citizen Mavzik is Alan Melikdjanian. (2) Alan Melikdjanian is from Russia. So, the answer is Russia. |
Alan Melikdjanian, known by the alias Captain Disillusion, is a Soviet-born American independent filmmaker and YouTuber. Melikdjanian et al, 2018 |
Misunderstood the retrieval information | Retrieved context: "Gerard III, Count of Guelders - Wikipedia" Gerard married Margaretha of Brabant... Q: Who is Gerard I, Count Of Guelders's wife? A: ... the name of Gerard I, Count of Guelders's wife, which is Margaretha of Brabant. |
The retrieved context accurately provided the name of Gerard III, Count of Guelders' wife, which is Margaretha of Brabant. However, the model erroneously referred to him as Gerard I, resulting in an incorrect answer. | |
Fail to address misinformation in retrieved documents | Retrieved context: [0] "James Vincent - Death Notice and Service Information - Legacy.com" James Vincent died on August 9, 2023... [1] "UPDATE: Jury convicts James Vincent of murder after deliberating..." A jury convicted James Vincent of murder in the 2018 Shooting death of Luka Grabb... Q: When did James Vincent die? A: James Vincent died on August 9, 2023, in Long Beach, California. |
In the presence of contradictory information from retrieved documents, the model should thoroughly outline the response, while also noting the need for clarity. For instance, the model could state: "According to the first document, James Vincent passed away on August 9, 2023, in Long Beach. However, the second document suggests that James Vincent was murdered in 2018." | |
Inference-level causes | Snowballing | Q: Was there ever a US senator who represented the state of New Hampshire and whose alma mater was the University of Pennsylvania? A: Yes... His name was John P. Hale |
John P. Hale was graduated from Bowdoin College. Zhang et al, 2023 |
Exposure bias | Q: Aamari was very stoic. [PRONOUN] rarely showed any emotion. A: He. |
The correct answer was Xe according to Hossain et al, 2023. |
Reference | Task | Dataset | Metrics | Human Eval | Evaluated LLMs | Granularity |
---|---|---|---|---|---|---|
FActScore Min et al, 2023 | Biography Generation | 183 people entities | F1 | ✓ | GPT-3.5, ChatGPT... |
T |
SelfCheckGPT Manakul et al, 2023 | Bio Generation | WikiBio | AUC-PR, Human Score |
✓ | GPT-3, LLaMA, OPT, GPT-J... |
S |
Wang et al, 2023 | Open QA | NQ, TQ | ACC, EM |
✓ | GPT-3.5, ChatGPT, GPT-4, Bing Chat |
S |
Pezeshkpour et al, 2023 | Knowledge Probing | T-REx, LAMA |
ACC | GPT3.5 | T | |
De Cao et al, 2021 | QA, Fact Checking |
KILT, FEVER, zsRE |
ACC | GPT-3, FLAN-T5 |
S/T | |
Varshney et al, 2023 | Article Generation | Unnamed Dataset | ACC, AUC |
GPT3.5, Vicuna |
S | |
FactTool Chern et al, 2023 | KB-based QA | RoSE | ACC, F1... |
GPT-4, ChatGPT, FLAN-T5 |
S | |
Kadavath et al, 2022 | Self-evaluation | BIG Bench, MMLU, LogiQA, TruthfulQA, QuALITY, TriviaQA Lambada |
ACC, Brier Score, RMS Calibration Error... |
Claude | T |
Reference | Task | Dataset | Metrics | Human Eval | Evaluated LLMs | Granularity |
---|---|---|---|---|---|---|
Retro Borgeaud et al, 2022 | QA, Language Modeling |
MassiveText, Curation Corpus, Wikitext103, Lambada, C4,Pile, NQ |
PPL, ACC, Exact Match |
✓ | Retro | T |
GenRead Yu et al, 2023 | QA, Dialogue, Fact Checking |
NQ, TQ, WebQ, FEVER, FM2, WoW |
EM, ACC, F1, Rouge-L |
- | GPT3.5, Codex GPT-3, Gopher FLAN, GLaM PaLM |
S |
GopherCite Menick et al, 2022 | Self-supported QA | NQ, ELI5, TruthfulQA (Health, Law, Fiction, Conspiracies) |
Human Score | ✓ | GopherCite | S |
Trivedi et al. Trivedi et al, 2023 | QA | HotpotQA, IIRC 2WikiMultihopQA, MuSiQue(music) |
Retrieval recall, Answer F1 |
- | GPT-3 FLAN-T5 |
S/T |
Peng et al. Peng et al, 2023 | QA, Dialogue |
DSTC7 track2 DSTC11 track5, OTT-QA |
ROUGE, chrF, BERTScore, Usefulness, Humanness... |
✓ | ChatGPT | S/T |
CRITIC Gou et al, 2023 | QA Toxicity Reduction |
AmbigNQ, TriviaQA, HotpotQA, RealToxicityPrompts |
Exact Match, maximum toxicity, perplexity, n-gram diversity, AUROC..., |
- | GPT-3.5 ChatGPT |
T |
Khot et al. Khot et al, 2023 | QA, long-context QA |
CommaQA-E, 2WikiMultihopQA, MuSiQue, HotpotQA | Exact Match, Answer F1 | - | GPT-3 FLAN-T5 |
T |
ReAct Yao et al, 2023 | QA Fact Verification |
HotpotQA, FEVER | Exact Match, ACC | - | PaLM GPT-3 |
S/T |
Jiang et al. Jiang et al, 2023 | QA, Commonsense Reasoning, long-form QA... |
2WikiMultihopQA, StrategyQA, ASQA, WikiAsp | Exact Match, Disambig-F1, ROUGE, entity F1... |
- | GPT-3.5 | T |
Lee et al. Lee et al, 2022 | Open-ended Generation | FEVER | Entity score, EntailmentRatio, ppl... | - | Megatron-LM | T |
SAIL Luo et al, 2023 | QA Fact Checking |
UniLC | ACC F1 |
- | LLaMA Vicuna SAIL |
T |
He et al. He et al, 2022 | Commonsense Reasoning, Temporal Reasoning, Tabular Reasoning |
StrategyQA, TempQuestions, IN-FOTABS | ACC | - | GPT-3 | T |
Pan et al. Pan et al, 2023 | Fact Checking | HOVER FEVEROUS-S |
Macro-F1 | - | Codex FLAN-T5 |
S |
Multiagent Debate Du et al, 2023 | Biography MMLU |
Unnamed Biography Dataset, MMLU |
ChatGPT Evaluator, ACC | - | Bard ChatGPT |
S |
Reference | Task Type | Dataset | Metrics | Performance of Representative LLMs |
---|---|---|---|---|
MMLU Hendrycks et al, 2021 | Multi-Choice QA | Humanities, Social, Sciences, STEM... |
ACC | (ACC, 5-shot) GPT-4: 86.4 GPT-3.5: 70 LLaMA2-70B: 68.9 |
TruthfulQA Lin et al, 2022 | QA | Health, Law, Conspiracies, Fiction... |
Human Score, GPT-judge, ROUGE, BLEU, MC1,MC2... |
(zero-shot) GPT-4: ~29 (MC1) GPT-3.5: ~28 (MC1), 79.92(%true) LLaMA2-70B: 53.37 (%true) |
C-Eval Huang et al, 2023 | Multi-Choice QA | STEM, Social Science, Humanities... |
ACC | (zero-shot, average ACC) GPT-4: 68.7 GPT-3.5: 54.4 LLaMA2-70B: 50.13 |
AGIEval Zhong et al, 2023 | Multi-Choice QA | Gaokao, (geometry, Bio, history...),SAT, Law... |
ACC | (zero-shot, average ACC) GPT-4: 56.4 GPT-3.5: 42.9 LLaMA2-70B: 40.02 |
HaluEval Li et al, 2023 | Hallucination Evaluation | HaluEval | ACC | (general ACC) GPT-3.5: 86.22 |
BigBench Srivastava et al, 2023 | Multi-tasks(QA, NLI, Fact Checking, Reasoning...) | BigBench | Metric to each type of task | (Big-Bench Hard) GPT-3.5: 49.6 LLaMA-65B: 42.6 |
ALCE Gao et al, 2023 | Citation Generation | ASQA, ELI5, QAMPARI |
MAUVE, Exact Match, ROUGE-L... | (ASQA, 3-psg, citation prec) GPT-3.5: 73.9 LLaMA-33B: 23.0 |
QUIP Weller et al, 2023 | Generative QA | TriviaQA, NQ, ELI5, HotpotQA |
QUIP-Score, Exact match | (ELI5, QUIP, null prompt) GPT-4: 21.0 GPT-3.5: 27.7 |
PopQA Mallen et al, 2023 | Multi-Choice QA | PopQA, EntityQuestions |
ACC | (overall ACC) GPT-3.5: ~37.0 |
UniLC Zhang et al, 2023 | Fact Checking | Climate, Health, MGFN |
ACC, F1 | (zero-shot, fact tasks, average F1) GPT-3.5: 51.62 |
Pinocchio Hu et al, 2023 | Fact Checking, QA, Reasoning | Pinocchio | ACC, F1 | GPT-3.5: (Zero-shot ACC: 46.8, F1:44.4) GPT-3.5: (Few-shot ACC: 47.1, F1:45.7) |
SelfAware Yin et al, 2023 | Self-evaluation | SelfAware | ACC | (instruction input, F1) GPT-4: 75.47 GPT-3.5: 51.43 LLaMA-65B: 46.89 |
RealTimeQA Kasai et al, 2022 | Multi-Choice QA, Generative QA | RealTimeQA | ACC, F1 | (original setting, GCS retrieval) GPT-3: 69.3 (ACC for MC) GPT-3: 39.4 (F1 for generation) |
FreshQA Vu et al, 2023 | Generative QA | FRESHQA | ACC (Human) | (strict ACC, null prompt) GPT-4: 28.6 GPT-3.5: 26.0 |
Reference | Domain | Task | Datasets | Metrics | Evaluated LLMs |
---|---|---|---|---|---|
Xie et al, 2023 | Finance | Sentiment analysis, News headline classification, Named entity recognition, Question answering, Stock movement prediction |
FLARE | F1, Acc, Avg F1, Entity F1, EM, MCC |
GPT-4 , BloombergGPT, FinMA-(7B, 30B, 7B-full), Vicuna-7B |
Li et al, 2023 | Finance | 134 E-com tasks | EcomInstruct | Micro-F1, Macro-F1, ROUGE |
BLOOM, BLOOMZ, ChatGPT, EcomGPT |
Wang et al, 2023 | Medicine | Multi-Choice QA | CMB | Acc | GPT-4, ChatGLM2-6B, ChatGPT, DoctorGLM, Baichuan-13B-chat, HuatuoGPT, MedicalGPT, ChatMed-Consult, ChatGLM-Med , Bentsao, BianQue-2 |
Li et al, 2023 | Medicine | Generative-QA | Huatuo-26M | BLEU, ROUGE, GLEU |
T5, GPT2 |
Jin et al, 2023 | Medicine | Nomenclature, Genomic location, Functional analysis, Sequence alignment |
GeneTuring | Acc | GPT-2, BioGPT, BioMedLM, GPT-3, ChatGPT, New Bing |
Guha et al, 2023 | Law | Issue-spotting, Rule-recall, Rule-application, Rule-conclusion, Interpretation, Rhetorical-understanding |
LegalBench | Acc, EM | GPT-4, GPT-3.5, Claude-1, Incite, OPT Falcon, LLaMA-2, FLAN-T5... |
Fei et al, 2023 | Law | Legal QA, NER, Sentiment Analysis, Reading Comprehension |
LawBench | F1, Acc, ROUGE-L, Normalized log-distance... |
GPT-4, ChatGPT, InternLM-Chat, StableBeluga2... |
Reference | Dataset | Metrics | Baselines ➝ Theirs | Dataset | Metrics | Baselines ➝ Theirs |
---|---|---|---|---|---|---|
Li et al, 2022 | NQ | EM | 34.5 ➝ 44.35 (T5 11B) | GSM8K | ACC | 77.0 ➝ 85.0 (ChatGPT) |
Yu et al, 2023 | NQ | EM | 20.9 ➝ 28.0 (InstructGPT) | TriviaQA | EM | 57.5 ➝ 59.0 (InstructGPT) |
- | - | - | - | WebQA | EM | 18.6 ➝ 24.6 (InstructGPT) |
Chuang et al, 2023 | FACTOR News | ACC | 58.3 ➝ 62.0 (LLaMa-7B) | FACTOR News | ACC | 61.1 ➝ 62.5 (LLaMa-13B) |
- | FACTOR News | ACC | 63.8 ➝ 65.4 (LLaMa-33B) | FACTOR News | ACC | 63.6 ➝ 66.2 (LLaMa-65B) |
- | FACTOR Wiki | ACC | 58.6 ➝ 62.2 (LLaMa-7B) | FACTOR Wiki | ACC | 62.6 ➝ 66.2 (LLaMa-13B) |
- | FACTOR Wiki | ACC | 69.5 ➝ 70.3 (LLaMa-33B) | FACTOR Wiki | ACC | 72.2 ➝ 72.4 (LLaMa-65B) |
- | TruthfulQA | %Truth * Info | 32.4 ➝ 44.6 (LLaMa-13B) | TruthfulQA | %Truth * Info | 34.8 ➝ 49.2 (LLaMa-65B) |
Li et al, 2022 | TruthfulQA | %Truth * Info | 32.4 ➝ 44.4 (LLaMa-13B) | TruthfulQA | %Truth * Info | 31.7 ➝ 36.7 (LLaMa-33B) |
- | TruthfulQA | %Truth * Info | 34.8 ➝ 43.4 (LLaMa-65B) | - | - | - |
Li et al, 2023 | NQ | ACC | 46.6 ➝ 51.3 (LLaMA-7B) | TriviaQA | ACC | 89.6 ➝ 91.1 (LLaMA-7B) |
- | MMLU | ACC | 35.7 ➝ 40.1 (LLaMA-7B) | TruthfulQA | %Truth * Info | 32.5 ➝ 65.1 (Alpaca) |
- | TruthfulQA | %Truth * Info | 26.9 ➝ 43.5 (LLaMa-7B) | TruthfulQA | %Truth * Info | 51.5 ➝ 74.0 (Vicuna) |
Cohen et al, 2023 | LAMA | F1 | 50.7 ➝ 80.8 (ChatGPT) | TriviaQA | F1 | 56.2 ➝ 82.6 (ChatGPT) |
- | NQ | F1 | 60.6 ➝ 79.1 (ChatGPT) | PopQA | F1 | 65.2 ➝ 85.4 (ChatGPT) |
- | LAMA | F1 | 42.5 ➝ 79.3 (GPT-3) | TriviaQA | F1 | 46.7 ➝ 77.2 (GPT-3) |
- | NQ | F1 | 52.0 ➝ 78.0 (GPT-3) | PopQA | F1 | 43.7 ➝ 77.4 (GPT-3) |
... |
Reference | Domain | Model | Eval Task | Eval Dataset | Continual Pretrained? | Continual SFT? | Train From Scratch? | External Knowledge |
---|---|---|---|---|---|---|---|---|
Zhang et al, 2023 | Healthcare | Baichuan-7B, Ziya-LLaMA-13B | QA | cMedQA2, WebMedQA, Huatuo-26M | ✔️ | |||
Yang et al, 2023 | Healthcare | Ziya-LLaMA-13B | QA | CMtMedQA, huatuo-26M | ✔️ | ✔️ | ||
Wang et al, 2023 | Healthcare | GPT-3.5-Turbo, LLaMA-2-13B | QA | MedQAUSMLE, MedQAMCMLE, MedMCQA | ✔️ | |||
Ross et al, 2022 | Healthcare | MOLFORMER | Molecule properties prediction | ✔️ | ||||
Bao et al, 2023 | Healthcare | Baichuan-13B | QA | CMB-Clin, CMD, CMID | ✔️ | |||
Guan et al, 2023 | Healthcare | ChatGPT | IU-RR, MIMIC-CXR | ✔️ | ||||
Liu et al, 2023 | Healthcare | GPT-4 | Medical Text De-Identification | ✔️ | ||||
Li et al, 2023 | Healthcare | LLaMA | QA | ✔️ | ||||
Venigalla et al, 2022 | Healthcare | GPT (2.7b) | QA | ✔️ | ||||
Xiong et al, 2023 | Healthcare | ChatGLM-6B | QA | ✔️ | ||||
Tan et al, 2023 | Healthcare | Baichuan-7B | QA | C-Eval, MMLU | ✔️ | |||
Luo et al, 2022 | Healthcare | GPT-2 | QA, DC, RE | ✔️ | ||||
Jin et al, 2023 | Healthcare | Codex | QA | GeneTuring | ✔️ | |||
Zakka et al, 2023 | Healthcare | text-davinci-003 | QA | ClinicalQA | ✔️ | |||
Liu et al, 2023 | Healthcare | GPT-2medium | Molecular Property Prediction, Molecule-text translation | ✔️ | ✔️ | |||
Nguyen et al, 2023 | Law | GPT3 | ✔️ | |||||
Savelka et al, 2023 | Law | GPT-4 | ✔️ | |||||
Huang et al, 2023 | Law | LLaMA | CN Legal Tasks | ✔️ | ✔️ | |||
Cui et al, 2023 | Law | Ziya-LLaMA-13B | QA | national judicial examination question | ✔️ | ✔️ | ||
Li et al, 2023 | Finance | BLOOMZ | 4 major tasks 12 subtasks | EcomInstruct | ✔️ | |||
Wu et al, 2023 | Finance | BLOOM | Financial NLP (SA, BC, NER, NER+NED, QA) | Financial Datasets | ✔️ | |||
Deng et al, 2023 | Geoscience | LLaMA-7B | GeoBench | ✔️ | ||||
Bai et al, 2023 | Geoscience | ChatGLM-6B | ✔️ | |||||
Fan et al, 2023 | Education | phoenix-inst-chat-7b | Chinese Grammatical Error Correction | ChatGPT-generated, Human-annotated | ✔️ | |||
Qi et al, 2023 | Food | Chinese-LLaMA2-13B | QA | ✔️ | ✔️ | |||
Wen et al, 2023 | Home Renovation | Baichuan-13B | C-Eval, CMMLU, EvalHome | ✔️ |
If you find this project useful in your research or work, please consider citing it:
@misc{wang2023survey,
title={Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity},
author={Cunxiang Wang and Xiaoze Liu and Yuanhao Yue and Xiangru Tang and Tianhang Zhang and Cheng Jiayang and Yunzhi Yao and Wenyang Gao and Xuming Hu and Zehan Qi and Yidong Wang and Linyi Yang and Jindong Wang and Xing Xie and Zheng Zhang and Yue Zhang},
year={2023},
eprint={2310.07521},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
- CHEN Liang (ChanLiang) for PR#1.
- JinheonBaek (JinheonBaek) for PR#2 and PR#3