Large Language Models on Tabular Data -- A Survey

@article{
fang2024large,
title={Large Language Models ({LLM}s) on Tabular Data: Prediction, Generation, and Understanding - A Survey},
author={Xi Fang and Weijie Xu and Fiona Anting Tan and Ziqing Hu and Jiani Zhang and Yanjun Qi and Srinivasan H. Sengamedu and Christos Faloutsos},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2024},
url={https://openreview.net/forum?id=IZnrCGF9WI},
note={}
}

Original paper

LLM on Tabular Data Prediction and Understanding -- A Survey

This repo is constructed for collecting and categorizing papers about diffusion models according to our survey paper——Large Language Models on Tabular Data -- A Survey. Considering the fast development of this field, we will continue to update both arxiv paper and this repo.

Abstract
Recent breakthroughs in large language modeling have facilitated rigorous exploration of their application in diverse tasks related to tabular data modeling, such as prediction, tabular data synthesis, question answering, and table understanding. Each task presents unique challenges and opportunities. However, there is currently a lack of comprehensive review that summarizes and compares the key techniques, metrics, datasets, models, and optimization approaches in this research domain. This survey aims to address this gap by consolidating recent progress in these areas, offering a thorough survey and taxonomy of the datasets, metrics, and methodologies utilized. It identifies strengths, limitations, unexplored territories, and gaps in the existing literature, while providing some insights for future research directions in this vital and rapidly evolving field. It also provides relevant code and datasets references. Through this comprehensive review, we hope to provide interested readers with pertinent references and insightful perspectives, empowering them with the necessary tools and knowledge to effectively navigate and address the prevailing challenges in the field.

Dataset	Dataset Number	Dataset Repo
OpenML	11	https://github.com/UW-Madison-Lee-Lab/LanguageInterfacedFineTuning/tree/master/regression/realdata/data
Kaggle API	169	https://github.com/Kaggle/kaggle-api
Combo	9	https://github.com/clinicalml/TabLLM/tree/main/datasets
UCI ML	20	https://github.com/dylan-slack/Tablet/tree/main/data/benchmark/performance
DDX	10	https://github.com/dylan-slack/Tablet/tree/main/data/ddx_data_no_instructions/benchmark

Dataset	# Tables	Task Type	Input	Output	Data Source	Dataset Repo
FetaQA	10330	QA	Table Question	Answer	Wikipedia	https://github.com/Yale-LILY/FeTaQA
WikiTableQuestion	2108	QA	Table Question	Answer	Wikipedia	https://ppasupat.github.io/WikiTableQuestions/
NQ-TABLES	169898	QA	Question, Table	Answer	Synthetic	https://github.com/google-research-datasets/natural-questions
HybriDialogue	13000	QA	Conversation, Table, Reference	Answer	Wikipedia	https://github.com/entitize/HybridDialogue
TAT-QA	2757	QA	Question, Table	Answer	Financial report	https://github.com/NExTplusplus/TAT-QA
HiTAB	3597	QA/NLG	Question, Table	Answer	Statistical Report and Wikipedia	https://github.com/microsoft/HiTab
ToTTo	120000	NLG	Table	Sentence	Wikipedia	https://github.com/google-research-datasets/ToTTo
FEVEROUS	28800	Classification	Claim, Table	Label	Common Crawl	https://fever.ai/dataset/feverous.html
Dresden Web Tables	125M	Classification	Table	Label	Common Crawl	https://ppasupat.github.io/WikiTableQuestions/
InfoTabs	2540	NLI	Table , Hypothesis	Label	Wikipedia	https://infotabs.github.io/
TabFact	16573	NLI	Table, Statement	Label	Wikipedia	https://tabfact.github.io/
TAPEX	1500	Text2SQL	SQL, Table	Answer	Synthetic	https://github.com/google-research/tapas
Spider	1020	Text2SQL	Table, Question	SQL	Human annotation	https://drive.usercontent.google.com/download?id=1iRDVHLr4mX2wQKSgA9J8Pire73Jahh0m&export=download&authuser=0
WIKISQL	24241	Text2SQL	Table, Question	SQL, Answer	Human Annotated	https://github.com/salesforce/WikiSQL
BIRD	12751	Text2SQL	Table, Question	SQL	Human Annotated	https://bird-bench.github.io/
Tapilot-Crossing	5	Text2Code, QA, RAG	Table, Dialog History, Question, Private Lib, Chart	Python, Private Lib Code, Answer	Human-Agent Interaction	https://tapilot-crossing.github.io/

tanfiona/LLM-on-Tabular-Data-Prediction-Table-Understanding-Data-Generation

Large Language Models on Tabular Data -- A Survey

LLM on Tabular Data Prediction and Understanding -- A Survey

Taxonomy

Prediction task

Tabular Data

Time series

Application Specific

Data Generation task

Table understanding

Numeric Question Answering

Question Answering

Text2SQL

Table2Text

Fact Verification

Table Profiling

Table Transformation

Entity Matching

Datasets

Prediction Tasks

Table Understanding Tasks

Survey

Contributing