Large Language Models on Tabular Data -- A Survey

@article{
fang2024large,
title={Large Language Models ({LLM}s) on Tabular Data: Prediction, Generation, and Understanding - A Survey},
author={Xi Fang and Weijie Xu and Fiona Anting Tan and Ziqing Hu and Jiani Zhang and Yanjun Qi and Srinivasan H. Sengamedu and Christos Faloutsos},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2024},
url={https://openreview.net/forum?id=IZnrCGF9WI},
note={}
}

Original paper

LLM on Tabular Data Prediction and Understanding -- A Survey

This repo is constructed for collecting and categorizing papers about diffusion models according to our survey paper——Large Language Models on Tabular Data -- A Survey. Considering the fast development of this field, we will continue to update both arxiv paper and this repo.

Abstract
Recent breakthroughs in large language modeling have facilitated rigorous exploration of their application in diverse tasks related to tabular data modeling, such as prediction, tabular data synthesis, question answering, and table understanding. Each task presents unique challenges and opportunities. However, there is currently a lack of comprehensive review that summarizes and compares the key techniques, metrics, datasets, models, and optimization approaches in this research domain. This survey aims to address this gap by consolidating recent progress in these areas, offering a thorough survey and taxonomy of the datasets, metrics, and methodologies utilized. It identifies strengths, limitations, unexplored territories, and gaps in the existing literature, while providing some insights for future research directions in this vital and rapidly evolving field. It also provides relevant code and datasets references. Through this comprehensive review, we hope to provide interested readers with pertinent references and insightful perspectives, empowering them with the necessary tools and knowledge to effectively navigate and address the prevailing challenges in the field.

336529724-fdd847f0-f232-474c-aaac-bc8232a42547 Figure 1: Overview of LLM on Tabular Data: the paper discusses application of LLM for prediction, data generation, and table understanding tasks.

LLMs_x_TabularData_KeyTechniques Figure 4: Key techniques in using LLMs for tabular data. The dotted line indicates steps that are optional.

Table of content:

Taxonomy

Prediction task


Tabular Data

TABLET: Learning From Instructions For Tabular Data [code]

Language models are weak learners

LIFT: Language-Interfaced Fine-Tuning for Non-Language Machine Learning Tasks
[code]

TabLLM: Few-shot Classification of Tabular Data with Large Language Models
[code]

UniPredict: Large Language Models are Universal Tabular Classifiers

Towards Foundation Models for Learning on Tabular Data

Towards Better Serialization of Tabular Data for Few-shot Classification with Large Language Models

Multimodal clinical pseudo-notes for emergency department prediction tasks using multiple embedding model for ehr (meme) **[code]

Text Serialization and Their Relationship with the Conventional Paradigms of Tabular Machine Learning

StructLM: Towards Building Generalist Models for Structured Knowledge Grounding

UniTabE: A Universal Pretraining Protocol for Tabular Foundation Model in Data Science

Unleashing the Potential of Large Language Models for Predictive Tabular Tasks in Data Science [model]

Synthetic Oversampling: Theory and A Practical Approach Using LLMs to Address Data Imbalance

Time series

LLMs learn governing principles of dynamical systems, revealing an in-context neural scaling law

PromptCast: A New Prompt-based Learning Paradigm for Time Series Forecasting

Large Language Models Are Zero-Shot Time Series Forecasters
TEST: Text Prototype Aligned Embedding to Activate LLM's Ability for Time Series

Time-LLM: Time Series Forecasting by Reprogramming Large Language Models
[code]

Application Specific

MediTab: Scaling Medical Tabular Data Predictors via Data Consolidation, Enrichment, and Refinement
[code]

CPLLM: Clinical Prediction with Large Language Models
[code]

SERVAL : Synergy Learning between Vertical Models and LLMs towards Oracle-Level Zero-shot Medical Prediction

CTRL: Connect Collaborative and Language Model for CTR Prediction

FinGPT: Open-Source Financial Large Language Models
[code]

Data Generation task


Language Models are Realistic Tabular Data Generators [code]

REaLTabFormer: Generating Realistic Relational and Tabular Data using Transformers

Generative Table Pre-training Empowers Models for Tabular Prediction [code]

TabuLa: Harnessing Language Models for Tabular Data Synthesis [code]

Curated LLM: Synergy of LLMs and Data Curation for tabular augmentation in ultra low-data regimes

TabMT: Generating tabular data with masked transformers

Elephants Never Forget: Testing Language Models for Memorization of Tabular Data

Graph-to-Text Generation with Dynamic Structure Pruning

Plan-then-Seam: Towards Efficient Table-to-Text Generation

Differentially Private Tabular Data Synthesis using Large Language Models

Pythia: Unsupervised Generation of Ambiguous Textual Claims from Relational Data

Table understanding


Numeric Question Answering

DocMath-Eval: Evaluating Numerical Reasoning Capabilities of LLMs in Understanding Long Documents with Tabular Data

Exploring the Numerical Reasoning Capabilities of Language Models: A Comprehensive Analysis on Tabular Data

TableGPT: Towards Unifying Tables, Nature Language and Commands into One GPT

Question Answering

Large Language Models are Versatile Decomposers: Decompose Evidence and Questions for Table-based Reasoning [code]

PACIFIC: Towards Proactive Conversational Question Answering over Tabular and Textual Data in Finance [code]

Large Language Models are few(1)-shot Table Reasoners [code]

cTBLS: Augmenting Large Language Models with Conversational Tables [code]

Table Meets LLM: Can Large Language Models Understand Structured Table Data? A Benchmark and Empirical Study

Large Language Models are Complex Table Parsers

Rethinking Tabular Data Understanding with Large Language Models [code]

TableGPT: Towards Unifying Tables, Nature Language and Commands into One GPT

Testing the Limits of Unified Sequence to Sequence LLM Pretraining on Diverse Table Data Tasks

Unified Language Representation for Question Answering over Text, Tables, and Images

SUQL: Conversational Search over Structured and Unstructured Data with Large Language Models [code]

TableLlama: Towards Open Large Generalist Models for Tables [code]

DIVKNOWQA: Assessing the Reasoning Ability of LLMs via Open-Domain Question Answering over Knowledge Base and Text

StructGPT: A General Framework for Large Language Model to Reason over Structured Data [code]

JarviX: A LLM No code Platform for Tabular Data Analysis and Optimization

CABINET: Content Relevance-based Noise Reduction for Table Question Answering **[code]

Traffic Performance GPT (TP-GPT): Real-Time Data Informed Intelligent ChatBot for Transportation Surveillance and Management

Data-Copilot: Bridging Billions of Data and Humans with Autonomous Workflow [code]

Querying Large Language Models with SQL

Text2SQL

Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation

DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction [code]

C3: Zero-shot Text-to-SQL with ChatGPT [code]

DBCopilot: Scaling Natural Language Querying to Massive Databases [code]

Bridging the Gap: Deciphering Tabular Data Using Large Language Model

TableQuery: Querying tabular data with natural language [code]

S2SQL: Injecting Syntax to Question-Schema Interaction Graph Encoder for Text-to-SQL Parsers

Dynamic hybrid relation network for cross-domain context-dependent semantic parsing

STAR: SQL Guided Pre-Training for Context-dependent Text-to-SQL Parsing

SUN: Exploring Intrinsic Uncertainties in Text-to-SQL Parsers

Towards Generalizable and Robust Text-to-SQL Parsing

Before Generation, Align it! A Novel and Effective Strategy for Mitigating Hallucinations in Text-to-SQL Generation [code]

Table2Text

Robust (Controlled) Table-to-Text Generation with Structure-Aware Equivariance Learning [code]

Fact Verification

Table-based Fact Verification with Salience-aware Learning [code]

Table Profiling

Cocoon: Semantic Table Profiling Using Large Language Models [code]

Table Transformation

Relationalizing Tables with Large Language Models: The Promise and Challenges

Entity Matching

Disambiguate Entity Matching using Large Language Models through Relation Discovery [code]

Datasets

Please refer to our paper to see relevant methods that benchmark on these datasets.

Prediction Tasks

Dataset Dataset Number Dataset Repo
OpenML 11 https://github.com/UW-Madison-Lee-Lab/LanguageInterfacedFineTuning/tree/master/regression/realdata/data
Kaggle API 169 https://github.com/Kaggle/kaggle-api
Combo 9 https://github.com/clinicalml/TabLLM/tree/main/datasets
UCI ML 20 https://github.com/dylan-slack/Tablet/tree/main/data/benchmark/performance
DDX 10 https://github.com/dylan-slack/Tablet/tree/main/data/ddx_data_no_instructions/benchmark

Table Understanding Tasks

Dataset # Tables Task Type Input Output Data Source Dataset Repo
FetaQA 10330 QA Table Question Answer Wikipedia https://github.com/Yale-LILY/FeTaQA
WikiTableQuestion 2108 QA Table Question Answer Wikipedia https://ppasupat.github.io/WikiTableQuestions/
NQ-TABLES 169898 QA Question, Table Answer Synthetic https://github.com/google-research-datasets/natural-questions
HybriDialogue 13000 QA Conversation, Table, Reference Answer Wikipedia https://github.com/entitize/HybridDialogue
TAT-QA 2757 QA Question, Table Answer Financial report https://github.com/NExTplusplus/TAT-QA
HiTAB 3597 QA/NLG Question, Table Answer Statistical Report and Wikipedia https://github.com/microsoft/HiTab
ToTTo 120000 NLG Table Sentence Wikipedia https://github.com/google-research-datasets/ToTTo
FEVEROUS 28800 Classification Claim, Table Label Common Crawl https://fever.ai/dataset/feverous.html
Dresden Web Tables 125M Classification Table Label Common Crawl https://ppasupat.github.io/WikiTableQuestions/
InfoTabs 2540 NLI Table , Hypothesis Label Wikipedia https://infotabs.github.io/
TabFact 16573 NLI Table, Statement Label Wikipedia https://tabfact.github.io/
TAPEX 1500 Text2SQL SQL, Table Answer Synthetic https://github.com/google-research/tapas
Spider 1020 Text2SQL Table, Question SQL Human annotation https://drive.usercontent.google.com/download?id=1iRDVHLr4mX2wQKSgA9J8Pire73Jahh0m&export=download&authuser=0
WIKISQL 24241 Text2SQL Table, Question SQL, Answer Human Annotated https://github.com/salesforce/WikiSQL
BIRD 12751 Text2SQL Table, Question SQL Human Annotated https://bird-bench.github.io/
Tapilot-Crossing 5 Text2Code, QA, RAG Table, Dialog History, Question, Private Lib, Chart Python, Private Lib Code, Answer Human-Agent Interaction https://tapilot-crossing.github.io/

Survey

A Survey on Text-to-SQL Parsing: Concepts, Methods, and Future Directions

Contributing

If you would like to contribute to this list or writeup, feel free to submit a pull request!