/ML4DB-paper-list

Papers for database systems powered by artificial intelligence (machine learning for database)

[Paper List] AI4DB / autonomous database / 智能数据库 / self-driving database

Paper list for database systems with artificial intelligence (machine learning, deep learning, reinforcement learning)

有关机器学习、神经网络、强化学习、自调优技术等在数据库系统中的应用的文章列表

Welcome to PR!

欢迎大家补充!

Table of Contents

System and Tutorial

  • SageDB: A Learned Database System (CIDR 2019)
  • Database Learning: Toward a Database that Becomes Smarter Every Time (SIGMOD 2017)
  • Self-Driving Database Management Systems (CIDR 2017)
  • Self-Driving : From General Purpose to Specialized DBMSs (Phd@PVLDB 2018)
  • Active Learning for ML Enhanced Database Systems (SIGMOD 2020)
  • Database Meets Artificial Intelligence: A Survey (TKDE 2020)
  • Self-driving database systems: a conceptual approach (Distributed and Parallel Databases 2020)
  • One Model to Rule them All: Towards Zero-Shot Learning for Databases (arXiv 2021)
  • UDO: Universal Database Optimization using Reinforcement Learning (arXiv 2021)
  • Towards a Benchmark for Learned Systems (SMDB workshop 2021)
  • A Unified Transferable Model for ML-Enhanced DBMS [Vision] (arXiv 2021)
  • AI Meets Database: AI4DB and DB4AI (SIGMOD 2021)
  • Expand your Training Limits! Generating Training Data for ML-based Data Management (SIGMOD 2021)
  • MB2: Decomposed Behavior Modeling for Self-Driving Database Management Systems (SIGMOD 2021)

Data Access

Configuration Tuning

  • SARD: A statistical approach for ranking database tuning parameters (ICDEW, 2008)
  • Regularized Cost-Model Oblivious Database Tuning with Reinforcement Learning (2016)
  • Automatic Database Management System Tuning Through Large-scale Machine Learning (SIGMOD 2017)
  • The Case for Automatic Database Administration using Deep Reinforcement Learning ( 2018 ArXiv)
  • An End-to-End Automatic Cloud Database Tuning System Using Deep Reinforcement Learning (SIGMOD 2019)
  • External vs. Internal : An Essay on Machine Learning Agents for Autonomous Database Management Systems
  • QTune: A Query-Aware Database Tuning System with Deep Reinforcement Learning (VLDB 2019)
  • Optimizing Databases by Learning Hidden Parameters of Solid State Drives (VLDB 2019)
  • iBTune: Individualized Buffer Tuning for Large-scale Cloud Databases (VLDB 2019)
  • Black or White? How to Develop an AutoTuner for Memory-based Analytics (SIGMOD 2020)
  • Learning Efficient Parameter Server Synchronization Policies for Distributed SGD (ICLR 2020)
  • Too Many Knobs to Tune? Towards Faster Database Tuning by Pre-selecting Important Knobs (HotStorage 2020)
  • Dynamic Configuration Tuning of Working Database Management Systems (LifeTech 2020)
  • Adaptive Multi-Model Reinforcement Learning for Online Database Tuning (EDBT 2021)
  • An inquiry into machine learning-based automatic configuration tuning services on real-world database management systems (VLDB 2021)

Physical Design

Learned structure

  • Stacked Filters: Learning to Filter by Structure (VLDB 2021)
  • LEA: A Learned Encoding Advisor for Column Stores (aiDM 2021)

LSM-tree related

  • Leaper: A Learned Prefetcher for Cache Invalidation in LSM-tree based Storage Engines (VLDB 2020)
  • From WiscKey to Bourbon: A Learned Index for Log-Structured Merge Trees (OSDI 2020)

Index

Index Structure
  • Learning to hash for indexing big data - A survey (2016)
  • The Case for Learned Index Structures (SIGMOD 2018)
  • A-Tree: A Bounded Approximate Index Structure (2017)
  • FITing-Tree: A Data-aware Index Structure (SIGMOD 2019)
  • Learned Indexes for Dynamic Workloads (2019)
  • SOSD: A Benchmark for Learned Indexes (2019)
  • Learning Multi-dimensional Indexes (2019)
  • ALEX: An Updatable Adaptive Learned Index (SIGMOD 2020)
  • The ML-Index: A Multidimensional, Learned Index for Point, Range, and Nearest-Neighbor Queries (EDBT 2020)
  • Effectively Learning Spatial Indices (VLDB 2020)
  • Stable Learned Bloom Filters for Data Streams (VLDB 2020)
  • START — Self-Tuning Adaptive Radix Tree (ICDEW 2020)
  • Learned Data Structures (2020)
  • The ML-Index: A Multidimensional, Learned Index for Point, Range, and Nearest-Neighbor Queries (EDBT 2020)
  • The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds (VLDB 2020)
  • A Tutorial on Learned Multi-dimensional Indexes (SIGSPATIAL 2020)
  • Benchmarking Learned Indexes (VLDB 2020)
  • Why Are Learned Indexes So Effective? (ICML 2020)
  • Tsunami: A Learned Multi-dimensional Index for Correlated Data and Skewed Workloads (VLDB 2020)
  • A Lazy Approach for Efficient Index Learning (2021)
  • Updatable Learned Index with Precise Positions (ArXiv 2021)
  • The RLR-Tree: A Reinforcement Learning Based R-Tree for Spatial Data (arXiv 2021)
  • Spatial Interpolation-based Learned Index for Range and kNN Queries (arXiv 2021)
  • APEX: A High-Performance Learned Index on Persistent Memory (arXiv 2021)
  • RUSLI: Real-time Updatable Spline Learned Index (aiDM 2021)
Index Recommendation
  • Index Selection in a Self- Adaptive Data Base Management System (SIGMOD 1976)
  • AutoAdmin 'What-if' Index Analysis Utility (SIGMOD 1998)
  • Self-Tuning Database Systems: A Decade of Progress (VLDB 2007)
  • AI Meets AI: Leveraging Query Executions to Improve Index Recommendations (SIGMOD 2019)
  • Automated Database Indexing using Model-free Reinforcement Learning (2020)
  • DRLindex: deep reinforcement learning index advisor for a cluster database (2020 Symposium on International Database Engineering & Applications)
  • Magic mirror in my hand, which is the best in the land? An Experimental Evaluation of Index Selection Algorithms (VLDB 2020)
  • An Index Advisor Using Deep Reinforcement Learning (CIKM 2020)
  • DBA bandits: Self-driving index tuning under ad-hoc, analytical workloads with safety guarantees (ICDE 2021)

Schema & Partition

Offline
  • Schism: a Workload-Driven Approach to Database Replication and Partitioning (VLDB 2010)
  • Skew-Aware Automatic Database Partitioning in Shared-Nothing, Parallel OLTP Systems (SIGMOD 2012)
  • Automated Data Partitioning for Highly Scalable and Strongly Consistent Transactions (2016 Transactions on Parallel and distributed systems)
  • GridFormation : Towards Self-Driven Online Data Partitioning using Reinforcement Learning (aiDM@SIGMOD 2018)
  • Learning a Partitioning Advisor with Deep Reinforcement Learning (2019)
  • Qd-tree: Learning Data Layouts for Big Data Analytics (SIGMOD 2020)
  • A Genetic Optimization Physical Planner for Big Data Warehouses (2020)
  • Lachesis: Automated Partitioning for UDF-Centric Analytics (VLDB 2021)
  • Instance-Optimized Data Layouts for Cloud Analytics Workloads (SIGMOD 2021)
  • Jigsaw: A Data Storage and Query Processing Engine for Irregular Table Partitioning (SIGMOD 2021)
Online
  • Relax and Let the Database Do the Partitioning Online (BIRTE 2011)
  • SWORD: Scalable Workload-Aware Data Placement for Transactional Workloads (EDBT 2013)
  • Online Data Partitioning in Distributed Database Systems (EDBT 2015)
  • A Robust Partitioning Scheme for Ad-Hoc Query Workloads (SOCC 2017)

Workload

Resource Estimation and Auto-scaling

  • Automated Demand-driven Resource Scaling in Relational Database-as-a-Service (SIGMOD 2016)
  • Database Workload Capacity Planning using Time Series Analysis and Machine Learning (SIGMOD 2020)
  • Seagull: An Infrastructure for Load Prediction and Optimized Resource Allocation (VLDB 2020)

Performance Diagnosis and Modeling

  • Performance and resource modeling in highly-concurrent OLTP workloads (SIGMOD 2013)
  • DBSherlock: A Performance Diagnostic Tool for Transactional Databases (SIGMOD 2016)
  • A Top-Down Approach to Achieving Performance Predictability in Database Systems (SIGMOD 2017)
  • Diagnosing Root Causes of Intermittent Slow Queries in Cloud Databases (VLDB 2020)
  • Workload-Aware Performance Tuning for Autonomous DBMSs (ICDE 2021)

Workload Shift Detection

  • Towards workload shift detection and prediction for autonomic databases (CIKM 2007)
  • Consistent on-line classification of dbs workload events (CIKM 2009)
  • On predictive modeling for optimizing transaction execution in parallel OLTP systems (VLDB 2011)

Metrics Prediction for Queries

  • PQR: Predicting Query Execution Times for Autonomous Workload Management (ICAC 2008)
  • Predicting multiple metrics for queries: Better decisions enabled by machine learning (ICDE 2009)
  • Learning-based SPARQL query performance modeling and prediction (WWW 2017)

Workload Characterization

  • On Workload Characterization of Relational Database Environments (TSE 1992)
  • Workload Models for Autonomic Database Management Systems (International Conference on Autonomic and Autonomous Systems 2006)
  • Workload characterization and prediction in the cloud: A multiple time series approach (APNOMS 2012)
  • Query-based Workload Forecasting for Self-Driving Database Management Systems (SIGMOD 2018)
  • Database Workload Characterization with Query Plan Encoders (arXiv 2021)

Query Optimization

Query Rewrite

  • Sia: Optimizing Queries using Learned Predicates (SIGMOD 2021)

Cardinality Estimation

  • Are We Ready For Learned Cardinality Estimation? (arXiv 2020)
  • A Unified Deep Model of Learning from both Data and Queries for Cardinality Estimation (SIGMOD 2021)
  • LATEST: Learning-Assisted Selectivity Estimation Over Spatio-Textual Streams (ICDE 2021)

Data-based

(kernal density model)

  • Self-Tuning, GPU-Accelerated Kernel Density Models for Multidimensional Selectivity Estimation (SIGMOD 2015)
  • Estimating Join Selectivities using Bandwidth-Optimized Kernel Density Models (VLDB 2017) (sum-product network)
  • DeepDB: Learn from Data, not from Queries! (VLDB 2020) (autoregressive model)
  • Deep Unsupervised Cardinality Estimation (VLDB 2019)
  • Multi-Attribute Selectivity Estimation Using Deep Learning (arXiv 2019)
  • Deep Learning Models for Selectivity Estimation of Multi-Attribute Queries (SIGMOD 2020)
  • NeuroCard: One Cardinality Estimator for All Tables (VLDB 2020)
  • Learning to Sample: Counting with Complex Queries (VLDB 2020) (graphical models)
  • Selectivity estimation using probabilistic models (SIGMOD 2001)
  • Lightweight graphical models for selectivity estimation without independence assumptions (VLDB 2011)
  • Efficiently adapting graphical models for selectivity estimation (VLDB 2013)
  • An Approach Based on Bayesian Networks for Query Selectivity Estimation (DASFAA 2019)
  • FLAT: Fast, Lightweight and Accurate Method for Cardinality Estimation (2020 arXiv)
  • Astrid: Accurate Selectivity Estimation for String Predicates using Deep Learning (VLDB 2021)
  • BayesCard: A Unified Bayesian Framework for Cardinality Estimation (2020)
  • Online Sketch-based Query Optimization (2021)
  • LMKG: Learned Models for Cardinality Estimation in Knowledge Graphs (2021)
  • LHist: Towards Learning Multi-dimensional Histogram for Massive Spatial Data (ICDE 2021)

Query-based

  • Adaptive selectivity estimation using query feedback (SIGMOD 1994)
  • Selectivity Estimation in Extensible Databases -A Neural Network Approach (VLDB 1998)
  • Effective query size estimation using neural networks. (Applied Intelligence 2002)
  • LEO - DB2's LEarning optimizer (VLDB 2011)
  • A Black-Box Approach to Query Cardinality Estimation (CIDR 07)
  • Cardinality Estimation Using Neural Networks (2015)
  • Towards a learning optimizer for shared clouds (VLDB 2018)
  • Learning State Representations for Query Optimization with Deep Reinforcement Learning (DEEM@SIGMOD2018)
  • Learned Cardinalities: Estimating Correlated Joins with Deep Learning (CIDR2019)
  • Estimating Cardinalities with Deep Sketches (SIGMOD 2019)
  • Selectivity estimation for range predicates using lightweight models (VLDB 2019)
  • (Review) An Empirical Analysis of Deep Learning for Cardinality Estimation (arXiv 2019)
  • Flexible Operator Embeddings via Deep Learning (arXiv 2019)
  • Improved Cardinality Estimation by Learning Queries Containment Rates (EDBT 2020)
  • NN-based Transformation of Any SQL Cardinality Estimator for Handling DISTINCT, AND, OR and NOT (2020)
  • QuickSel: Quick Selectivity Learning with Mixture Models (SIGMOD 2020)
  • Efficiently Approximating Selectivity Functions using Low Overhead Regression Models (VLDB 2020)
  • Flow-Loss: Learning Cardinality Estimates That Matter (arXiv 2020)
  • Learned Cardinality Estimation for Similarity Queries (SIGMOD 2021)

Cost Estimation

Single Query

  • Statistical learning techniques for costing XML queries (VLDB 2005)
  • Predicting multiple metrics for queries: Better decisions enabled by machine learning (icde 2009)
  • The Case for Predictive Database Systems : Opportunities and Challenges (CIDR 2011)
  • Learning-based query performance modeling and prediction (ICDE 2012)
  • Robust estimation of resource consumption for SQL queries using statistical techniques (VLDB 2012)
  • Plan-Structured Deep Neural Network Models for Query Performance Prediction (arXiv 2019)
  • An End-to-End Learning-based Cost Estimator (arXiv 2019)(VLDB 2019)
  • Cost Models for Big Data Query Processing: Learning, Retrofitting, and Our Findings (2020)
  • DBMS Fitting: Why should we learn what we already know? (CIDR 2020)
  • A Note On Operator-Level Query Execution Cost Modeling (2020)
  • Efficient Deep Learning Pipelines for Accurate Cost Estimations Over Large Scale Query Workload (arXiv 2021)

Concurrent

  • PQR: Predicting query execution times for autonomous workload management (ICAC 2008)
  • Performance Prediction for Concurrent Database Workloads (SIGMOD 2011)
  • Predicting completion times of batch query workloads using interaction-aware models and simulation(EDBT 2011)
  • Interaction-aware scheduling of report-generation workloads (VLDB 2011) (有调度策略)
  • Towards predicting query execution time for concurrent and dynamic database workloads (not machine learning) (VLDB 2014)
  • Contender: A Resource Modeling Approach for Concurrent Query Performance Prediction (EDBT 2014)
  • Query Performance Prediction for Concurrent Queries using Graph Embedding (VLDB 2020)

Join Optimization

  • Adaptive Optimization of Very Large Join Queries (SIGMOD 2018) (Not machine learning
  • Deep Reinforcement Learning for Join Order Enumeration (aiDM@SIGMOD 2018)
  • Learning to Optimize Join Queries With Deep Reinforcement Learning (ArXiv)
  • Reinforcement Learning with Tree-LSTM for Join Order Selection (ICDE 2020)
  • Research Challenges in Deep Reinforcement Learning-based Join Query Optimization (aiDM 2020)

Query Plan

  • Plan Selection Based on Query Clustering (VLDB 2002)
  • Cost-Based Query Optimization via AI Planning (AAAI 2014)
  • Sampling-Based Query Re-Optimization (SIGMOD 2016)
  • Learning State Representations for Query Optimization with Deep Reinforcement Learning (DEEM@SIGMOD2018)
  • Towards a Hands-Free Query Optimizer through Deep Learning (CIDR 2019)
  • Neo: A Learned Query Optimizer (VLDB 2019)
  • Bao: Learning to Steer Query Optimizers (2020)
  • ML-based Cross-Platform Query Optimization (ICDE 2020)
  • Learning-based Declarative Query Optimization (2021)
  • Bao: Making LearnedQuery Optimization Practica (SIGMOD 2021)
  • Microlearner: A fine-grained Learning Optimizer for Big Data Workloads at Microsoft (2021)
  • Steering Query Optimizers: A Practical Take on Big Data Workloads (SIGMOD 2021)

Query Execution

Sort

  • The Case for a Learned Sorting Algorithm (SIGMOD 2020)

Join

  • SkinnerDB : Regret-Bounded Query Evaluation via Reinforcement Learning (VLDB 2018)

Adaptive Query Processing

  • Eddies: Continuously adaptive query processing. (SIGMOD 2000)
  • Micro adaptivity in Vectorwise (SIGMOD 2013)
  • Cuttlefish: A Lightweight Primitive for Adaptive Query Processing (2018)

Approximate Query Processing

  • DBEST: Revisiting approximate query processing engines with machine learning models (SIGMOD 2019)
  • LAQP: Learning-based Approximate Query Processing (2020)
  • Approximate Query Processing for Data Exploration using Deep Generative Models (ICDE 2020)
  • ML-AQP: Query-Driven Approximate Query Processing based on Machine Learning (2020)
  • Approximate Query Processing for Group-By Queries based on Conditional Generative Models (2021)
  • Learned Approximate Query Processing: Make it Light, Accurate and Fast (CIDR 2021)

Sheduling

  • Workload management for cloud databases via machine learning (ICDE 2016 WiseDB)
  • A learning-based service for cost and performance management of cloud databases (ICDEW 2017)(short version for WiSeDB)
  • WiSeDB: A Learning-based Workload Management Advisor for Cloud Databases (2016 VLDB)
  • CrocodileDB: Efficient Database Execution through Intelligent Deferment (CIDT 2020)
  • Buffer Pool Aware Query Scheduling via Deep Reinforcement Learning (2020)
  • Polyjuice: High-Performance Transactions via Learned Concurrency Control (arXiv 2021)

(transaction 👇)

  • Scheduling OLTP transactions via learned abort prediction (aiDM@SIGMOD 2019)
  • Scheduling OLTP Transactions via Machine Learning (2019)

SQL Related

  • Query2Vec (ArXiv)
  • An End-to-end Neural Natural Language Interface for Databases
  • SQLNet: Generating Structured Queries From Natural Language Without Reinforcement Learning (ArXiv)
  • Facilitating SQL Query Composition and Analysis (ArXiv 2020)
  • Natural language to SQL: Where are we today? (VLDB 2020)
  • From Natural Language Processing to Neural Databases (VLDB 2021)
  • BERT Meets Relational DB: Contextual Representations of Relational Databases
  • Natural language to SQL Resource repo