Large Language Models on Tabular Data -- A Survey

@misc{fang2024large,
      title={Large Language Models on Tabular Data -- A Survey}, 
      author={Xi Fang and Weijie Xu and Fiona Anting Tan and Jiani Zhang and Ziqing Hu and Yanjun Qi and Scott Nickleach and Diego Socolinsky and Srinivasan Sengamedu and Christos Faloutsos},
      year={2024},
      eprint={2402.17944},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Original paper

LLM on Tabular Data Prediction and Understanding -- A Survey

This repo is constructed for collecting and categorizing papers about diffusion models according to our survey paper——Large Language Models on Tabular Data -- A Survey. Considering the fast development of this field, we will continue to update both arxiv paper and this repo.

Abstract
Recent breakthroughs in large language modeling have facilitated rigorous exploration of their application in diverse tasks related to tabular data modeling, such as prediction, tabular data synthesis, question answering, and table understanding. Each task presents unique challenges and opportunities. However, there is currently a lack of comprehensive review that summarizes and compares the key techniques, metrics, datasets, models, and optimization approaches in this research domain. This survey aims to address this gap by consolidating recent progress in these areas, offering a thorough survey and taxonomy of the datasets, metrics, and methodologies utilized. It identifies strengths, limitations, unexplored territories, and gaps in the existing literature, while providing some insights for future research directions in this vital and rapidly evolving field. It also provides relevant code and datasets references. Through this comprehensive review, we hope to provide interested readers with pertinent references and insightful perspectives, empowering them with the necessary tools and knowledge to effectively navigate and address the prevailing challenges in the field.

Figure 1: Overview of LLM on Tabular Data: the paper discusses application of LLM for prediction, data generation, and table understanding tasks.

Figure 4: Key techniques in using LLMs for tabular data. The dotted line indicates steps that are optional.

Table of content:

Taxonomy
Datasets

Taxonomy

Prediction task

Table understanding

Numeric Question Answering

DocMath-Eval: Evaluating Numerical Reasoning Capabilities of LLMs in Understanding Long Documents with Tabular Data

Exploring the Numerical Reasoning Capabilities of Language Models: A Comprehensive Analysis on Tabular Data

TableGPT: Towards Unifying Tables, Nature Language and Commands into One GPT

Question Answering

Large Language Models are Versatile Decomposers: Decompose Evidence and Questions for Table-based Reasoning [code]

PACIFIC: Towards Proactive Conversational Question Answering over Tabular and Textual Data in Finance [code]

Large Language Models are few(1)-shot Table Reasoners [code]

cTBLS: Augmenting Large Language Models with Conversational Tables [code]

Table Meets LLM: Can Large Language Models Understand Structured Table Data? A Benchmark and Empirical Study

Large Language Models are Complex Table Parsers

Rethinking Tabular Data Understanding with Large Language Models [code]

TableGPT: Towards Unifying Tables, Nature Language and Commands into One GPT

Testing the Limits of Unified Sequence to Sequence LLM Pretraining on Diverse Table Data Tasks

Unified Language Representation for Question Answering over Text, Tables, and Images

TableLlama: Towards Open Large Generalist Models for Tables [code]

DIVKNOWQA: Assessing the Reasoning Ability of LLMs via Open-Domain Question Answering over Knowledge Base and Text

StructGPT: A General Framework for Large Language Model to Reason over Structured Data [code]

JarviX: A LLM No code Platform for Tabular Data Analysis and Optimization

Text2SQL

Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation

DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction [code]

C3: Zero-shot Text-to-SQL with ChatGPT [code]

Bridging the Gap: Deciphering Tabular Data Using Large Language Model

TableQuery: Querying tabular data with natural language [code]

Datasets

Please refer to our paper to see relevant methods that benchmark on these datasets.

Prediction Tasks

Dataset	Dataset Number	Dataset Repo
OpenML	11	https://github.com/UW-Madison-Lee-Lab/LanguageInterfacedFineTuning/tree/master/regression/realdata/data
Kaggle API	169	https://github.com/Kaggle/kaggle-api
Combo	9	https://github.com/clinicalml/TabLLM/tree/main/datasets
UCI ML	20	https://github.com/dylan-slack/Tablet/tree/main/data/benchmark/performance
DDX	10	https://github.com/dylan-slack/Tablet/tree/main/data/ddx_data_no_instructions/benchmark

Table Understanding Tasks

Dataset	# Tables	Task Type	Input	Output	Data Source	Dataset Repo
FetaQA	10330	QA	Table Question	Answer	Wikipedia	https://github.com/Yale-LILY/FeTaQA
WikiTableQuestion	2108	QA	Table Question	Answer	Wikipedia	https://ppasupat.github.io/WikiTableQuestions/
NQ-TABLES	169898	QA	Question, Table	Answer	Synthetic	https://github.com/google-research-datasets/natural-questions
HybriDialogue	13000	QA	Conversation, Table, Reference	Answer	Wikipedia	https://github.com/entitize/HybridDialogue
TAT-QA	2757	QA	Question, Table	Answer	Financial report	https://github.com/NExTplusplus/TAT-QA
HiTAB	3597	QA/NLG	Question, Table	Answer	Statistical Report and Wikipedia	https://github.com/microsoft/HiTab
ToTTo	120000	NLG	Table	Sentence	Wikipedia	https://github.com/google-research-datasets/ToTTo
FEVEROUS	28800	Classification	Claim, Table	Label	Common Crawl	https://fever.ai/dataset/feverous.html
Dresden Web Tables	125M	Classification	Table	Label	Common Crawl	https://ppasupat.github.io/WikiTableQuestions/
InfoTabs	2540	NLI	Table , Hypothesis	Label	Wikipedia	https://infotabs.github.io/
TabFact	16573	NLI	Table, Statement	Label	Wikipedia	https://tabfact.github.io/
TAPEX	1500	Text2SQL	SQL, Table	Answer	Synthetic	https://github.com/google-research/tapas
Spider	1020	Text2SQL	Table, Question	SQL	Human annotation	https://drive.usercontent.google.com/download?id=1iRDVHLr4mX2wQKSgA9J8Pire73Jahh0m&export=download&authuser=0
WIKISQL	24241	Text2SQL	Table, Question	SQL, Answer	Human Annotated	https://github.com/salesforce/WikiSQL

Contributing

If you would like to contribute to this list or writeup, feel free to submit a pull request!

george1459/LLM-on-Tabular-Data-Prediction-Table-Understanding-Data-Generation