@article{
fang2024large,
title={Large Language Models ({LLM}s) on Tabular Data: Prediction, Generation, and Understanding - A Survey},
author={Xi Fang and Weijie Xu and Fiona Anting Tan and Ziqing Hu and Jiani Zhang and Yanjun Qi and Srinivasan H. Sengamedu and Christos Faloutsos},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2024},
url={https://openreview.net/forum?id=IZnrCGF9WI},
note={}
}
This repo is constructed for collecting and categorizing papers about diffusion models according to our survey paper——Large Language Models on Tabular Data -- A Survey. Considering the fast development of this field, we will continue to update both arxiv paper and this repo.
Abstract
Recent breakthroughs in large language modeling have facilitated rigorous exploration of their application in diverse tasks related to tabular data modeling, such as prediction, tabular data synthesis, question answering, and table understanding. Each task presents unique challenges and opportunities. However, there is currently a lack of comprehensive review that summarizes and compares the key techniques, metrics, datasets, models, and optimization approaches in this research domain. This survey aims to address this gap by consolidating recent progress in these areas, offering a thorough survey and taxonomy of the datasets, metrics, and methodologies utilized. It identifies strengths, limitations, unexplored territories, and gaps in the existing literature, while providing some insights for future research directions in this vital and rapidly evolving field. It also provides relevant code and datasets references. Through this comprehensive review, we hope to provide interested readers with pertinent references and insightful perspectives, empowering them with the necessary tools and knowledge to effectively navigate and address the prevailing challenges in the field.
Figure 1: Overview of LLM on Tabular Data: the paper discusses application of LLM for prediction, data generation, and table understanding tasks.
Figure 4: Key techniques in using LLMs for tabular data. The dotted line indicates steps that are optional.
Table of content:
TABLET: Learning From Instructions For Tabular Data [code]
Language models are weak learners
LIFT: Language-Interfaced Fine-Tuning for Non-Language Machine Learning Tasks
[code]
TabLLM: Few-shot Classification of Tabular Data with Large Language Models
[code]
UniPredict: Large Language Models are Universal Tabular Classifiers
Towards Foundation Models for Learning on Tabular Data
Towards Better Serialization of Tabular Data for Few-shot Classification with Large Language Models
Multimodal clinical pseudo-notes for emergency department prediction tasks using multiple embedding model for ehr (meme) **[code]
StructLM: Towards Building Generalist Models for Structured Knowledge Grounding
UniTabE: A Universal Pretraining Protocol for Tabular Foundation Model in Data Science
Unleashing the Potential of Large Language Models for Predictive Tabular Tasks in Data Science [model]
Synthetic Oversampling: Theory and A Practical Approach Using LLMs to Address Data Imbalance
LLMs learn governing principles of dynamical systems, revealing an in-context neural scaling law
PromptCast: A New Prompt-based Learning Paradigm for Time Series Forecasting
Large Language Models Are Zero-Shot Time Series Forecasters
TEST: Text Prototype Aligned Embedding to Activate LLM's Ability for Time Series
Time-LLM: Time Series Forecasting by Reprogramming Large Language Models
[code]
MediTab: Scaling Medical Tabular Data Predictors via Data Consolidation, Enrichment, and Refinement
[code]
CPLLM: Clinical Prediction with Large Language Models
[code]
CTRL: Connect Collaborative and Language Model for CTR Prediction
FinGPT: Open-Source Financial Large Language Models
[code]
Language Models are Realistic Tabular Data Generators [code]
REaLTabFormer: Generating Realistic Relational and Tabular Data using Transformers
Generative Table Pre-training Empowers Models for Tabular Prediction [code]
TabuLa: Harnessing Language Models for Tabular Data Synthesis [code]
Curated LLM: Synergy of LLMs and Data Curation for tabular augmentation in ultra low-data regimes
TabMT: Generating tabular data with masked transformers
Elephants Never Forget: Testing Language Models for Memorization of Tabular Data
Graph-to-Text Generation with Dynamic Structure Pruning
Plan-then-Seam: Towards Efficient Table-to-Text Generation
Differentially Private Tabular Data Synthesis using Large Language Models
Pythia: Unsupervised Generation of Ambiguous Textual Claims from Relational Data
TableGPT: Towards Unifying Tables, Nature Language and Commands into One GPT
Large Language Models are Versatile Decomposers: Decompose Evidence and Questions for Table-based Reasoning [code]
PACIFIC: Towards Proactive Conversational Question Answering over Tabular and Textual Data in Finance [code]
Large Language Models are few(1)-shot Table Reasoners [code]
cTBLS: Augmenting Large Language Models with Conversational Tables [code]
Large Language Models are Complex Table Parsers
Rethinking Tabular Data Understanding with Large Language Models [code]
TableGPT: Towards Unifying Tables, Nature Language and Commands into One GPT
Testing the Limits of Unified Sequence to Sequence LLM Pretraining on Diverse Table Data Tasks
Unified Language Representation for Question Answering over Text, Tables, and Images
SUQL: Conversational Search over Structured and Unstructured Data with Large Language Models [code]
TableLlama: Towards Open Large Generalist Models for Tables [code]
StructGPT: A General Framework for Large Language Model to Reason over Structured Data [code]
JarviX: A LLM No code Platform for Tabular Data Analysis and Optimization
CABINET: Content Relevance-based Noise Reduction for Table Question Answering **[code]
Data-Copilot: Bridging Billions of Data and Humans with Autonomous Workflow [code]
Querying Large Language Models with SQL
Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation
DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction [code]
C3: Zero-shot Text-to-SQL with ChatGPT [code]
DBCopilot: Scaling Natural Language Querying to Massive Databases [code]
Bridging the Gap: Deciphering Tabular Data Using Large Language Model
TableQuery: Querying tabular data with natural language [code]
S2SQL: Injecting Syntax to Question-Schema Interaction Graph Encoder for Text-to-SQL Parsers
Dynamic hybrid relation network for cross-domain context-dependent semantic parsing
STAR: SQL Guided Pre-Training for Context-dependent Text-to-SQL Parsing
SUN: Exploring Intrinsic Uncertainties in Text-to-SQL Parsers
Towards Generalizable and Robust Text-to-SQL Parsing
Before Generation, Align it! A Novel and Effective Strategy for Mitigating Hallucinations in Text-to-SQL Generation [code]
Robust (Controlled) Table-to-Text Generation with Structure-Aware Equivariance Learning [code]
Table-based Fact Verification with Salience-aware Learning [code]
Cocoon: Semantic Table Profiling Using Large Language Models [code]
Relationalizing Tables with Large Language Models: The Promise and Challenges
Disambiguate Entity Matching using Large Language Models through Relation Discovery [code]
Please refer to our paper to see relevant methods that benchmark on these datasets.
Dataset | Dataset Number | Dataset Repo |
---|---|---|
OpenML | 11 | https://github.com/UW-Madison-Lee-Lab/LanguageInterfacedFineTuning/tree/master/regression/realdata/data |
Kaggle API | 169 | https://github.com/Kaggle/kaggle-api |
Combo | 9 | https://github.com/clinicalml/TabLLM/tree/main/datasets |
UCI ML | 20 | https://github.com/dylan-slack/Tablet/tree/main/data/benchmark/performance |
DDX | 10 | https://github.com/dylan-slack/Tablet/tree/main/data/ddx_data_no_instructions/benchmark |
Dataset | # Tables | Task Type | Input | Output | Data Source | Dataset Repo |
---|---|---|---|---|---|---|
FetaQA | 10330 | QA | Table Question | Answer | Wikipedia | https://github.com/Yale-LILY/FeTaQA |
WikiTableQuestion | 2108 | QA | Table Question | Answer | Wikipedia | https://ppasupat.github.io/WikiTableQuestions/ |
NQ-TABLES | 169898 | QA | Question, Table | Answer | Synthetic | https://github.com/google-research-datasets/natural-questions |
HybriDialogue | 13000 | QA | Conversation, Table, Reference | Answer | Wikipedia | https://github.com/entitize/HybridDialogue |
TAT-QA | 2757 | QA | Question, Table | Answer | Financial report | https://github.com/NExTplusplus/TAT-QA |
HiTAB | 3597 | QA/NLG | Question, Table | Answer | Statistical Report and Wikipedia | https://github.com/microsoft/HiTab |
ToTTo | 120000 | NLG | Table | Sentence | Wikipedia | https://github.com/google-research-datasets/ToTTo |
FEVEROUS | 28800 | Classification | Claim, Table | Label | Common Crawl | https://fever.ai/dataset/feverous.html |
Dresden Web Tables | 125M | Classification | Table | Label | Common Crawl | https://ppasupat.github.io/WikiTableQuestions/ |
InfoTabs | 2540 | NLI | Table , Hypothesis | Label | Wikipedia | https://infotabs.github.io/ |
TabFact | 16573 | NLI | Table, Statement | Label | Wikipedia | https://tabfact.github.io/ |
TAPEX | 1500 | Text2SQL | SQL, Table | Answer | Synthetic | https://github.com/google-research/tapas |
Spider | 1020 | Text2SQL | Table, Question | SQL | Human annotation | https://drive.usercontent.google.com/download?id=1iRDVHLr4mX2wQKSgA9J8Pire73Jahh0m&export=download&authuser=0 |
WIKISQL | 24241 | Text2SQL | Table, Question | SQL, Answer | Human Annotated | https://github.com/salesforce/WikiSQL |
BIRD | 12751 | Text2SQL | Table, Question | SQL | Human Annotated | https://bird-bench.github.io/ |
Tapilot-Crossing | 5 | Text2Code, QA, RAG | Table, Dialog History, Question, Private Lib, Chart | Python, Private Lib Code, Answer | Human-Agent Interaction | https://tapilot-crossing.github.io/ |
A Survey on Text-to-SQL Parsing: Concepts, Methods, and Future Directions
If you would like to contribute to this list or writeup, feel free to submit a pull request!